Learning from Data in Radio Algorithm Design

Timothy James O’Shea

Dissertation submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Electrical Engineering

T. Charles Clancy

Robert W. McGwier

Narendran Ramakrishnan

Sanjay Raman

Jeffrey Reed

Oct 26th, 2017

Arlington, Virginia

Keywords: deep learning, radio, physical layer, software radio, machine learning, neural

networks, sensing, communications system design, modulation, coding, sensing

Copyright 2017, Timothy James O’Shea Learning from Data in Radio Algorithm Design

Timothy James O’Shea

ABSTRACT

Algorithm design methods for radio communications systems are poised to undergo a massive disruption over the next several years. Today, such algorithms are typically de- signed manually using compact analytic problem models. However, they are shifting increasingly to machine learning based methods using approximate models with high degrees of freedom, jointly optimized over multiple subsystems, and using real-world data to drive design which may have no simple compact probabilistic analytic form. Over the past five years, this change has already begun occurring at a rapid pace in several fields. Computer vision tasks led deep learning, demonstrating that low level features and entire end-to-end systems could be learned directly from complex imagery datasets, when a powerful collection of optimization methods, regularization methods, architec- ture strategies, and efficient implementations were used to train large models with high degrees of freedom. Within this work, we demonstrate that this same class of end-to-end deep neural network based learning can be adapted effectively for physical layer radio systems in order to optimize for sensing, estimation, and waveform synthesis systems to achieve state of the art levels of performance in numerous applications. First, we discuss the background and fundamental tools used, then discuss effective strategies and approaches to model design and optimization. Finally, we explore a se- ries of applications across estimation, sensing, and waveform synthesis where we apply this approach to reformulate classical problems and illustrate the value and impact this approach can have on several key radio algorithm design problems. Learning from Data in Radio Algorithm Design

Timothy James O’Shea

GENERAL AUDIENCE ABSTRACT

Radio communications and sensing systems are used pervasively in the modern world every day life to connect phones, computers, smart devices, industrial devices, inter- net services, space systems, emergency and military users, radar systems, interference monitoring systems, defense electronic systems, and others. Optimizing these systems to function together reliably and efficently in an ever more complex world is becoming increasingly hard and impractical. Our work introduces a new and radically different method for the design of radio sys- tems by casting them in a new way as artificial intelligence problems relying on the field of machine learning called deep learning to find and optimize their design. We detail and demonstrate the first such deep learning based communciations and sensing systems op- erating on raw radio signals and quantify their performance when compared to existing methods, showing them to be competitive with and in some cases significantly better performing than state of the art systems today. These ideas, and the evidence of their viability, are central to the emerging field of ma- chine learning communications systems, and will help to make tomorrow’s sys- tems faster, cheaper, more reliable, more adaptive, more efficient, and lower power than currently possible. In a world of ever increasing complexity and connectedness, this new approach to wireless system design from data using machine learning offers a power- ful new strategy to improve systems by directly leveraging the complexity in real world data and experience to find efficiencies where current day approaches and insufficient simplified models and design tools can not. Acknowledgments

Thank you to all my current and former colleagues at Virgina Tech, NC State, Bell Labs, the US Government, the GNU Radio Community and industry who supported, critiqued, mentored, collaborated, co-authored and discussed countless ideas surrounding software radio, cognitive radio, and deep learning, especially my advisor Charles Clancy, who has been a constant source of support and inspiration, and has provided me with significant freedom to explore new and disruptive ideas.

I am also very grateful to the individuals and organizations who have supported myself and my work throughout my studies including VT, DeepSig, DARPA, NSF, DOD, LM,

Hawkeye360, Federated Wireless and others who made much of this possible.

iv Dedication

This work is dedicated to my family, friends, colleagues, mentors, sponsors and research inspirations, all of whom have supported me and contributed to this work in countless immeasurable ways for which I am extremely grateful.

More abstractly, this work is dedicated to engineering as a creative discipline. While many engineering fields have become complex and tedious, end-to-end learning based approaches to design offer to relieve some of the tedium and slow progress surrounding the field today.

It is my sincere hope that the future of engineering will become more of a creative outlet for experimentalists, contrarians, pragmatists and makers. That the expansion of machine learning will empower all people to create and to view engineering in a positive, fun, and creative light and artform, accessable to all rather than as the obscure, slow moving, and specialized field that it can sometimes seem today.

v Contents

1 Introduction 1

1.1 Chasing Optimality in Communication System Design ...... 3

1.2 Neural Networks in Radio System Design ...... 4

1.3 Implications, Trends and Challenges in Deep Learning ...... 6

1.4 Deep Cognitive Radio Systems ...... 8

2 Background 10

2.1 Radio Signal Processing ...... 11

2.1.1 Digital Communications ...... 12

2.1.2 Radio Channel Models ...... 15

2.2 Cognitive Radio ...... 21

2.2.1 Sensing Techniques ...... 22

vi 2.2.2 Control Modeling ...... 23

2.3 Deep Learning Models ...... 24

2.3.1 Error Feedback and Objectives ...... 24

2.3.2 Network Model Primitives ...... 29

2.3.3 Regularization ...... 34

2.3.4 Architectural Strategies ...... 37

2.3.5 High Performance Computing ...... 39

2.3.6 Model Search ...... 44

2.3.7 Model Introspection ...... 45

3 Learning to Communicate 50

3.1 The Channel Autoencoder ...... 52

3.2 Learning to Synchronize with Attention ...... 65

3.3 Multi-User Interference Channel ...... 71

3.4 Learning Multi-Antenna Diversity Channels ...... 77

3.5 Learning MIMO with CSI Feedback ...... 81

3.6 System Identification Over the Air ...... 87

vii 4 Learning to Label the Radio Spectrum 89

4.1 Learning Estimators from Data ...... 91

4.2 Learning to Identify Modulation Types ...... 99

4.2.1 Expert Features for Modulation Recognition (Baseline) ...... 101

4.2.2 Time series Modulation Classification With CNNs ...... 103

4.2.3 Deep Residual Network Time-series Modulation Classification . . . 108

4.3 Learning to Identify Radio Protocols ...... 136

4.4 Learning to Detect Signals ...... 141

5 Learning Radio Structure 150

5.1 Unsupervised Structure Learning ...... 151

5.2 Unsupervised Class Discovery ...... 155

5.3 Neural Network Model Discovery and Optimization ...... 159

6 Conclusion 164

6.1 Publication List ...... 167

Bibliography 172

viii List of acronyms

ACF auto-correlation function

ADC analog-to-digital converter

AE autoencoder

AI artificial intelligence

AM amplitude modulation

ANN artificial neural network

ARP address resolution protocol

AWGN additive white Gaussian noise

BCE binary cross-entropy

BER bit error rate

BLER block error rate

ix BPSK binary phase shift keying

CAF cross ambiguity function

CCE categorical cross-entropy

CFO carrier frequency offset

CNN convolutional neural network

CQI channel quality information

CR cognitive radio

CSI channel state information

CUDA Compute Unified Device Architecture

DL deep learning

DAC digital to analog converter

DNN deep neural network

DNS domain name server

DOF degrees of freedom

DSA dynamic spectrum access

DSP digital signal processing

x DTree decision tree

EM electromagnetic

FEC forward error correction

FFT fast Fourier transform

FLOPS floating point operations per second

FM frequency modulation

FSK frequency shift keying

FV Fisher Vector

GF galois field

GR GNU Radio

GRU gated recurrent unit

GPGPU general purpose graphic processing unit

GPU graphic processing unit

GMR ground mobile radio

HMM hidden Markov model

HOC higher order cumulants

xi HOS higher order statistic

HOM higher order moment

I/Q In-phase and Quadrature

ICA independent component analysis

IEEE Institute of Electrical and Electronics Engineers

IID independent and identically distributed

IOU Intersection over union

ISM industrial, scientific, and medical radio

ISI inter-symbol interference

LDPC low density parity check

LO local oscillator

LOS line of sight

LTE long term evolution

LSTM long short-term memory

LTI linear time invariant

MAE mean absolute error

xii MAP maximum a posteriori

MF matched filter

MFCC Mel-frequency cepstral coefficient

MIMO multiple-input multiple-output

ML machine learning

MLD maximum likelihood

MLE maximum likelihood estimation

MLSP machine learning for signal processing

MMSE minimum mean square error

MNIST Modified National Institute of Standards and Technology

MRSA mean-response scaled initializations

MU multi-user

NNSP neural networks for signal processing

MSE mean squared error

NLP natural languasge processing

NN neural network

xiii OFDM orthogonal frequency-division

OODA observe orient decide act

OTA over-the-air

PAPR peak to average power ratio

PCA principal component analysis

PHY physical layer

PPM parts per million

PPB parts per billion

PSK phase-shif keying

QAM quadrature amplitude modulation

QRNN quasi-recurrent neural network

QoS quality of service

QPSK quadrature phase shift keying

R-CNN region-based convolutional neural network

ReLU rectified linear unit

ResNet residual network

xiv RF radio frequency

RFIC radio frequency integrated circuit

RNN recurrent neural network

ROC receiver operating characteristic

RRC root-raised cosine

RTN radio transformer network

SCF spectral correlation function

SDR software-defined radio

SGD stochastic gradient descent

SELU scaled exponential linear units

SIC successive interference cancellation

SIFT scale-invariant feature transform

SNR signal-to-noise ratio

SRO symbol rate offset

STN spatial transformer network

SoC system-on-chip

xv STBC space-time block code

SVM support vector machine t-SNE t-distributed stochastic neighbor embedding

TS time-slotted

USRP universal software radio peripheral

YOLO you only look once

ZF zero forcing

xvi List of Figures

2.1 Direct Conversion Radio Front-End Architecture ...... 11

2.2 Impulse Response Plots of Varying Delay Spreads ...... 17

2.3 A single fully connected neuron ...... 31

2.4 A simple 1D 2-long 2-filter convolutional layer ...... 31

2.5 A sequence of 2D convolutional layers from AlexNet [1] ...... 32

2.6 An example dilated convolution structure from WaveNet [2] ...... 33

2.7 Dropout effect on network connectivity, from [3] ...... 35

2.8 Example Effect of Dropout on Training and Validation Loss ...... 36

2.9 A single residual network unit, from [4] ...... 37

2.10 An exemplary residual network stack, from [4] ...... 38

2.11 Spatial transformer network structure, from [5] ...... 39

2.12 Single threading ceiling illustrated, from [6] ...... 40

xvii 2.13 Concurrent GPU vs CPU compute architecture scaling (2017), from [7] . . . 43

2.14 Evolutionary performance of image classifier search, from [8] ...... 45

2.15 Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9] . . . . 46

2.16 Filter activation visualization in convolutional neural networks (CNNs),

from [9] ...... 46

2.17 Optimization of input images for feature activation, from [10] ...... 47

2.18 GradCAM Saliency Maps for Dogs and Cats, from [11] ...... 48

2.19 Information theoretic visualization of deep learning, from [12] ...... 49

3.1 Illustration of the many modular algorithms present in a modern wireless

physical layer modem such as long term evolution (LTE) ...... 51

3.2 The Fundamental Communications Learning Problem ...... 53

3.3 A simple autoencoder for a 2D MNIST image, from [14] ...... 53

3.4 A Simple Channel Autoencoder ...... 54

3.5 BLER versus Eb/N0 for autoencoder ...... 57

3.6 BLER versus Eb/N0 for autoencoder ...... 58

xviii 3.7 Constellations produced by autoencoders using parameters (n, k): (a) (2, 2)

(b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-

distributed stochastic neighbor embedding (t-SNE) embedding of received

symbols...... 59

3.8 Learned QAM Modes for Example Mean Power (EMP) ...... 61

3.9 Learned QAM Modes for Batch Mean Power (BMP) ...... 61

3.10 Learned QAM Modes for Batch Mean Amplitude (BMA) ...... 62

3.11 Learned QAM Modes for Batch Mean Max Power (BMMP) ...... 63

3.12 Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit) . . . . . 64

3.13 Spatial Transformer Example on MNIST Digit from [5] ...... 66

3.14 Radio Transformer Network Architecture ...... 67

3.15 Autoencoder training loss with and without RTN ...... 69

3.16 BLER versus Eb/N0 for various communication schemes over a channel

with L = 3 Rayleigh fading taps ...... 69

3.17 The two-user interference channel seen as a combination of two interfering

autoencoders that try to reconstruct their respective messages ...... 72

3.18 block error rate (BLER) versus Eb/N0 for the two-user interference channel

achieved by the autoencoder (AE) and 22k/n-quadrature amplitude modu-

lation (QAM) time-slotted (TS) for different parameters (n, k) ...... 74

xix 3.19 Learned constellations for the two-user interference channel with parame-

ters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points of

Transmitter 1 and 2 are represented by red dots and black crosses, respec-

tively...... 75

3.20 Open Loop MIMO Channel Autoencoder Architecture ...... 78

3.21 Alamouti Coding Scheme for 2x1 Open Loop MIMO ...... 79

3.22 Error Rate Performance of Learned Diversity Scheme...... 79

3.23 2x1 MIMO AE, Diagonal H ...... 80

3.24 2x1 MIMO AE, Random H ...... 80

3.25 Closed Loop MIMO Learning Autoencoder Architecture ...... 82

3.26 Error Rate Performance of Learned 2x2 Scheme (Perfect CSI)...... 82

3.27 Closed Loop MIMO Autoencoder with Quantized Feedback ...... 83

3.28 Bit Error Rate Performance of Baseline ZF Method ...... 84

3.29 Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2 Closed-

Loop Scheme with Quantized CSI ...... 85

3.30 Learned 2x2 Scheme 1 bit CSI Random Channels...... 86

3.31 Learned 2x2 Scheme 1-bit CSI All-Ones Channel...... 86

3.32 Learned 2x2 Scheme 2-bit CSI Random Channels...... 86

xx 3.33 Learned 2x2 Scheme 2-bit CSI All-Ones Channel...... 86

3.34 Deployment Configuration for Quantized MIMO Autoencoder ...... 87

4.1 CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset . . . 92

4.2 Timing Estimation MAE Comparison ...... 97

4.3 Mean CFO Estimation Absolute Error for AWGN Channel ...... 98

4.4 Mean CFO Estimation Absolute Error (Fading σ=0.5) ...... 98

4.5 Mean CFO Estimation Absolute Error (Fading σ=1) ...... 99

4.6 Mean CFO Estimation Absolute Error (Fading σ=2) ...... 99

4.7 Traditional Approach to Modulation Recognition, from [15] ...... 102

4.8 10 Modulation CNN performance comparison of accuracy vs signal-to-

noise ratio (SNR) ...... 106

4.9 Confusion matrix of the CNN (SNR = 10 dB) ...... 107

4.10 System for modulation recognition dataset signal generation and synthetic

channel impairment modeling ...... 109

4.11 Over the air capture system diagram ...... 112

4.12 Picture of over the air lab capture and training system ...... 112

xxi 4.13 Example graphic of high level feature learning based residual network ar-

chitecture for modulation recognition ...... 113

4.14 Complex time domain examples of 24 modulations from the dataset at sim-

ulated 10dB Eb/N0 and ` = 256 ...... 115

4.15 Complex time domain examples of 24 modulations over the air at high

SNR and ` = 256 ...... 116

4.16 Complex constellation examples of 24 modulations from the dataset at sim-

ulated 10dB Eb/N0 and ` = 256 ...... 116

4.17 Complex time domain examples of 24 modulations from the dataset at sim-

ulated 0dB Eb/N0 and ` = 256 ...... 117

4.18 11-Modulation normal dataset performance comparison (N=1M) ...... 118

4.19 24-Modulation difficult dataset performance comparison (N=240k) . . . . . 119

4.20 Residual unit and residual stack architectures ...... 120

4.21 Resnet performance under various channel impairments (N=240k) . . . . . 121

4.22 Baseline performance under channel impairments (N=240k) ...... 121

4.23 Comparison models under LO impairment ...... 122

4.24 ResNet performance vs depth (L = number of residual stacks) ...... 123

xxii 4.25 Modrec performance vs modulation type (Resnet on synthetic data with

N=1M, σclk=0.0001) ...... 124

4.26 24-modulation confusion matrix for ResNet trained and tested on synthetic

dataset with N=1M, additive white Gaussian noise (AWGN), and SNR ≥ 0dB ...... 125

4.27 Performance vs training set size (N) with ` = 1024 ...... 126

4.28 24-modulation confusion matrix for ResNet trained and tested on synthetic

dataset with N=1M and σclk = 0.0001 ...... 127

4.29 Performance vs example length in samples (`)...... 128

4.30 24-modulation confusion matrix for ResNet trained and tested on OTA ex-

amples with SNR 10dB ...... 129 ∼

4.31 Resnet transfer learning OTA performance ...... 130

4.32 24-modulation confusion matrix for ResNet trained on synthetic σclk =

0.0001 and tested on OTA examples with SNR 10 dB (prior to fine-tuning) 132 ∼

4.33 24-modulation confusion matrix for ResNet trained on synthetic σclk =

0.0001 and tested on OTA examples with SNR 10 dB (after fine-tuning) . . 133 ∼

4.34 Transfer function of the LSTM unit, from [16] ...... 137

4.35 Best LSTM256 confusion with RNN length of 512 time-steps ...... 139

4.36 Detection Algorithm Trade-space Sensitivity vs Specialization ...... 141

xxiii 4.37 Computer Vision CNN-based Object Detection Trade Space, from [17] . . . 143

4.38 Example bounding box detections in computer vision, from [17] ...... 144

4.39 YOLO style per-grid-cell bounding box regression targets ...... 146

4.40 Radio bounding box detection examples, from [18] ...... 147

4.41 Over the air wideband signal bounding box prediction example ...... 148

5.1 Example Radio Communications Basis Functions ...... 151

5.2 Convolutional Autoencoder Architecture for Signal Compression ...... 152

5.3 Convolutional Autoencoder reconstruction of QPSK example 1 ...... 153

5.4 Convolutional Autoencoder reconstruction of QPSK example 2 ...... 154

5.5 AE Encoder Filter Weights ...... 154

5.6 AE Decoder Filter Weights ...... 154

5.7 Supervised Embedding Approach ...... 157

5.8 Unsupervised Embedding Approach ...... 157

5.9 Supervised Signal Embeddings ...... 158

5.10 Unsupervised Signal Embeddings ...... 158

5.11 Compact Model Network Digraph and Hyper-Parameter Search Process . . 160

5.12 EvolNN ModRec Net Search Accuracy ...... 161

xxiv 5.13 EvolNN MNIST Net Search Accuracy ...... 161

5.14 EvolNN CFO estimation network search loss ...... 162

xxv List of Tables

2.1 List of widely used neural network (NN) optimization loss functions . . . . 25

2.2 List of activation functions ...... 30

3.1 Layout of the autoencoder used in Figs. 3.6 and 3.5. It has (2M + 1)(M + n) + 2M

trainable parameters, resulting in 62, 791, and 135,944 parameters for the

(2,2), (7,4), and (8,8) autoencoder, respectively...... 56

3.2 Candidate channel autoencoder transmit normalization functions ...... 60

3.3 Layout of the multi-user autoencoder model ...... 73

4.1 ANN Architecture Used for CFO Estimation ...... 94

4.2 ANN Architecture Used for Timing Estimation ...... 94

4.3 Layout for our 10 modulation CNN modulation classifier ...... 105

4.4 Random Variable Initialization ...... 109

4.5 Features Used ...... 112

xxvi 4.6 CNN Network Layout ...... 115

4.7 ResNet Network Layout ...... 115

4.8 Protocol traffic classes considered for classification ...... 137

4.9 Recurrent network architecture used for network traffic classification . . . . 138

4.10 Performance measurements for RNN protocol classification for varying se-

quence lengths ...... 138

4.11 Table input/output shapes ...... 145

5.1 Final small MNIST search CNN network ...... 161

5.2 Final Modrec search CNN network ...... 161

xxvii Chapter 1

Introduction

Algorithms in radio signal processing have advanced drastically over the past hundred years. Today’s radio physical layer has evolved to become a complex collection of highly specialized disciplines of research. forward error correction (FEC), channel state infor- mation (CSI) estimation, equalization, multi-carrier modulation, multi-antenna transmis- sion schemes, and numerous other specific areas of research have each become mature research fields in which many people specialize and achieve small incremental improve- ments within a highly compartmented and modular specialized subsystems.

Meanwhile, deep learning (DL) [19] has been rapidly disrupting numerous algorithmic information processing fields by re-thinking problems as end-to-end optimization prob- lems rather than as collections of highly specialized hand tailored subsystem models.

Many problems in wireless communications are ripe for this form of high level rethinking

1 Timothy J. O’Shea Chapter 1. Introduction 2 in the context of end-to-end system optimization, and a new class of optimization tools offers the possibility to cope with system complexities and degrees of freedom which were previously intractible for direct complete-system optimization.

Throughout this work, we consider many current models and approaches to wireless communications, the application of neural networks to radio signal processing, recent advances in large scale network optimization behind deep learning, and disruptive ap- plications and ways these techniques can fundamentally change how communications systems are designed.

Throughout this work, we motivate embracing wireless signal processing problems as data centric machine learning problems, demonstrating the significant potential of end- to-end learning approaches which can be used in constrast to more traditional simplified analytic subsystem model driven approaches. While ultimately some mix of the two is currently the best solution in many cases, much of this work is intended to provide a contrarian perspective to the status quo in the field, embracing and comparing quanti- tative performance with baselines as much as possible, but also attempting to see how far we can go in relying on completely learned systems rather than incremental hybrid approaches. Timothy J. O’Shea Chapter 1. Introduction 3

1.1 Chasing Optimality in Communication System Design

Since the seminal works of Shannon [20] in establishing upper bounds for capacity nd performance in communications systems (further detailed in chapter 2.1.1) much of the focus of radio communications research has been on trying to achieve this near-optimal level of performance in real world systems.

In recent years, techniques such as turbo codes [21], turbo product codes [22], low den- sity parity check (LDPC) codes [23] and other modulation techniques such as orthogonal frequency-division multiplexing (OFDM) [24] and multiple-input multiple-output (MIMO) have allowed for performance which comes quite close to this limit. Key enablers for modern FEC codes enabling this have been large block sizes with probabilistic models

(such as belief propagation) which iteratively compute most likely codewords based on soft log-likelihoods estimated from received symbols.

Several attempts have been made to extend this maximum likelihood (MLD) block code- word selection task to probabilistically encompass earlier physical layer tasks such as equalization, synchronization and interference cancellation. Approaches in this field in- clude successive interference cancellation (SIC) [25], as well as factor graph/belief prop- agation models [26, 27]. Both of these have shown to be attractive from a sensitivity and bit error rate performance under certain harsh conditions, but both run into difficulty in practical use due to computational complexity limitations and the exponential complex- ity problem of increasing realism, complexity and degrees of freedom (DOF) in closed Timothy J. O’Shea Chapter 1. Introduction 4 form analytic channel, emitter, and interferer models.

1.2 Neural Networks in Radio System Design

The use of artificial neural networks in radio signal processing is not a new idea. Sig- nificant interest in this area rose and fell in the 80s and 90s. Institute of Electrical and

Electronics Engineers (IEEE) even developed technical committees such as the neural net- works for signal processing (NNSP) which surged initially in interest, looking at appli- cations of learning to signal processing tasks (later renamed to machine learning for sig- nal processing (MLSP) when neural networks fell out of favor). Numerous ideas which

I revisit in this work were proposed long ago: Neuro-evolution was proposed in 1994

[28, 29], Neural network based forward error correction code decoding was proposed in

1989 [30], Neural network based modulation recognition was proposed in 1985 [31], and many other early works exist which first considered the ways in which neural networks could be applied to difficult regression and classification problems in the context of the radio signal processing domain.

Unfortunately, during this first surge of interest, the optimization algorithmic tools, com- putational tools, regularization tools, data storage capabilities, data gathering capabili- ties, and many other requisites for large scale data centric algorithm learning were not yet available for practitioners. Because of this, many people wrote these ideas off as in- tractable, or overly complex to be of practical use, and relied instead on more compact Timothy J. O’Shea Chapter 1. Introduction 5 models based on either toy analytic problem representations or max-margin style data re- ducing optimization techniques (e.g. support vector machine (SVM) [32]). artificial neural network (ANN) methods were generally regarded as failed and uninteresting for quite some time. Several researchers including, most famously, Hinton, Bengio, LeCun, and

Schmidthuber continued heavy research into NN optimization silently for many years building and maturing the tools which today allow model and network complexity to scale many orders of magnitude above what was possible at that time.

With the emergence of deep learning in 2012 and the results demonstrating the ability of such techniques to scale, it was clear that any prior assumptions made in the signal processing domain with regards to model performance, complexity, feasibility and prac- ticality needed to be completely re-evaluated in light of modern algorithms and compu- tational capabilities.

In this work, I hope to provide a significant re-consideration of many of the core func- tions within radio signal processing algorithm design, re-cast fundamental radio signal processing tasks in the context of modern DL optimization tools, capabilities, and con- structs, and compare the efficacy of data-centric design of algorithms to the state of the art methods used today which rely currently much more heavily on complex manual system engineering and algorithm design. Timothy J. O’Shea Chapter 1. Introduction 6

1.3 Implications, Trends and Challenges in Deep Learning

Many of the ideas which constitute DL have been around for quite some time. However, it was not until relatively recently (e.g. AlexNet in 2012[1]) that many recent ideas in net- work architecture, training, regularization, and high performance implementation were combined to great effect, that deep learning really gained widespread attention, adoption, and success. Alexnet was one of the first major efforts and publications which employed these techniques and provided an order of magnitude improvement in machine learn- ing (ML) model performance, in this case on the ImageNet [33] dataset and classification task, reducing top-1 classification error rates by around 10% from 47.1% to 37.5% accu- racy.

The key breakthrough here were that it was now possible to train very large many- free-parameter models using gradient descent on high performance graphic processing unit (GPU) architectures directly from large datasets, with sufficient regularization us- ing end-to-end learning and low level feature learning, which could outperform previ- ous state of the art systems with many years of analytic feature engineering and tuning such as scale-invariant feature transform (SIFT) [34] and Fisher Vectors (FVs) [35]. Since then, this trend of low-level feature learning outperforming hand engineered features has replaced the state of the art in computer vision, and has shown the same capacity to replace low level features such as Mel-frequency cepstral coefficients (MFCCs) in time do- main voice processing [36] and equivalents in natural languasge processing (NLP). This Timothy J. O’Shea Chapter 1. Introduction 7 trend towards low level feature learning from raw data is likely to continue to subsume many domains’ existing feature extractors and pre-processors into learned equivalents optimized directly on high level objectives.

Knowledge of domain specific information however is not discarded or unneeded in the frightening way that this statement may however seem. For instance in [36], MFCC filters are replaced with learned features, but the network architecture is set up to allow for time domain convolutional filter learning of quite similarly structured filter taps which happen to fit the real distribution of the human voice spectrum slightly better than a pure dyadic scale. This trend of building NN architectures which leverage prior expert domain knowledge and combine it with end-to-end learned architectures to find more optimal solutions is bound to continue and to enable the combination of domain knowledge with state of the art NN architecture approaches to yield new state of the art results in many domains.

One of the key breakthroughs in understanding of deep learning and why it really works was set forth in [37]. Here Dauphine demonstrates that as the number of free-parameters in the model goes up, the probability of getting stuck in a [non global] local minima goes down, and an optimizer such as stochastic gradient descent (SGD) is more likely to instead encounter a saddle point, which can be further optimized in a non-terminal fashion. This results underlies why Deep Learning works, why it can find good solu- tions, and why large/deep networks which are of higher dimension than the minimum required solution are actually key to the ability to find globally good solutions rapidly. As Timothy J. O’Shea Chapter 1. Introduction 8 a result, the field of compressing or pruning these large networks to a smaller minimal subset once trained has also become an important research area which has shown very promising results in reducing computation and network size once a global solution has been found [38, 39].

While we have already discussed feature learning as a key trend, end-to-end learning continues to extend the scope of the model which can be trained in an end-to-end fash- ion. Attention models or saliency are key methods in which end-to-end problems may have their learning architecture decomposed into sub-tasks which help to deal with very high dimensional inputs. By focusing attention within a high dimensional space, regional proposal networks (Fast R-CNN [40]) and spatial transformer networks [5] both demon- strated key methods in which low complexity front-end networks could direct a small patch of transformed relevant input into a secondary discriminative network to operate within the relevant input more effectively and with a canonicalized form with various permutations removed. This design strategy is critical to high dimensional search prob- lems like the Street View House Number recognition task [41], and are extremely applicable to high dimensional radio search spaces as we will discuss later.

1.4 Deep Cognitive Radio Systems

In this work, we consider how cognitive radio, which has been a slow moving idealized dream over the past handful of years, can be truly realized from a ground-up level of Timothy J. O’Shea Chapter 1. Introduction 9 physical layer learning using the new tools for high dimensional model learning which are now available. By combining the end-to-end and feature learning methodologies which have been highly successful in the computer vision domain and other domains, with a ground up approach to radio algorithm learning, the results shown herein demon- strate that a true breakthrough in cognitive radio is finally possible, where we can learn sensing, waveform synthesis, and control behaviors for radio signals which are uncon- strained by rigid pre-processors, problem formulations, or other assumptions which were previously necessary in order to make the problem tractable under an older learning regime. For lack of a better expression, we term this combination of deep learning as an enabler for realizing cognitive radio capabilities, deep cognitive radio. Throughout this document we hope to better explain and quantitatively demonstrate the potential in this concrete realization of this powerful union. Chapter 2

Background

This work spans between three distinct disciplines which are rapidly converging. digital signal processing (DSP) for radio communications systems provides the core knowledge surrounding analog to digital conversion, sampling theorum, dynamic range and signal to noise ratio management, and algorithmic knowledge. cognitive radio (CR) builds on

DSP and software radio appling ideas from artificial intelligence (AI) to help automate and optimize radio applications for specialized objectives. deep learning (DL) has re- cently grown as a rapidly accelerating field withing AI which relies on large datasets, error feedback, and high level objective functions to guide the formation of very large parametric models, previously intractible with other AI approaches. Before combining and extending these three technologies throughout this dissertation, an overview of key background concepts, models, and approaches is provided within this section. This is particularly important in the Radio Deep Learning field as most practitioners at this point

10 Timothy J. O’Shea Chapter 2. Background 11

come either from the radio signal processing field or the machine learning field, and sel-

dom have a deep experience in both. This is likely to change quickly over the coming

years as the field of radio signal processing adopts these techniques and begins to use

data science centric language more accessible to the machine learning community.

2.1 Radio Signal Processing

Figure 2.1: Direct Conversion Radio Front-End Architecture

Radio signal processing has a rich history which can barely be scratched within the scope of this background section. We focus primarily on modern digital radio communications systems and radio sensing systems. In both cases radio front-end hardware typically employs a single or multi-stage oscillator and mixer to convert signals between a specific radio frequency and baseband (or low intermediate frequency) for digitization. Filters are used to reject energy from outside of the desired radio frequency (RF) band both at RF frequencies (RF band pass filters), and as low-pass image-rejection filters at either DC or Timothy J. O’Shea Chapter 2. Background 12 low IF frequencies. analog-to-digital converters (ADCs) and digital to analog converters

(DACs) are used to convert analog baseband energy to and from discretized baseband sampled representation within the digital side of a radio system.

Today, the vast majority of low cost radio systems leverage this form of direct conversion

[42] digital transceiver hardware architecture (shown in figure 2.1) and perform signal specific signal processing (e.g. detection, modulation or demodulation) on the resulting sampled complex valued quadrature baseband signal digitally in microprocessors (often termed baseband processing units or digital signal processors).

In many mobile devices (e.g. phones, computers, tablets) this whole radio front-end hard- ware architecture may all be combined into a single system-on-chip (SoC). Most com- monly today, the Analog Devices AD9361 [43] is used to perform most of these steps in common lab software-defined radio (SDR) hardware used herein, but numerous similar chips exist from , Samsung, Broadcom, and others.

2.1.1 Digital Communications

Today, the vast majority of radio systems are implemented using digital signal processing, and of those many of the most ubiquitous which we use every day are digital modula- tions carrying binary information between computing platforms such as phones, laptops, tablets, cars, base stations, , spacecrafts, airplanes, boats, law enforcement radios, and virtually every other platform frequented by mankind. These systems share a num- Timothy J. O’Shea Chapter 2. Background 13 ber of key properties which must be understood and will apply to machine learning based radio communications systems as much as they apply to current day systems.

Sampling Theory gives us theoretical bounds on the conversion of information between analog and digital forms. Nyquist observed [44] in 1928, that to perform undistorted re- construction of a radio telegraph signal of a given (speed of signaling), one must sample the signal at a rate of twice the bandwidth in order to become unambiguous

(to avoid aliasing of portions of the signal). This is today known as the Nyquist Frequency

(or critical sampling frequency) and represents the speed at which any signal must be sampled in order to avoid distortion due to aliasing, two times the highest frequency of the underlying signal. While there does exist an area of investigation in digital communi- cations into compressive sensing which breaks this assumption, virtually all systems we use today sample at or above the nyquist frequency, and we shall continue to assume a system sampled at or above nyquist for the purposes of our investigations herein. Com- pressive sensing based machine learning systems which drop this assumption do pose an interesting prospect, but we do not consider them in our work here. We also generally do not model the effects of quantization which occur within the sampling process. An analog signal of peak-to-peak voltage V, can be divided into 2N discrete voltage levels spanning [-V/2,V/2] when converted to an N-bit digital equivalent with reconstruction

V error bounded by  N+1 . However, for this to be true, we must assume the signal ≤ 2 amplitude is scaled appropriately for the converters range, and that the dynamic range of the analog signal plus noise can be sufficiently represented within N bits. Many mod- Timothy J. O’Shea Chapter 2. Background 14

ern ADCs employ 14 or 16 bit conversion, including those used in our measurements

(largely universal software radio peripheral (USRP) [45] devices based on the AD9361

[43] chip), these provide sufficient dynamic range for many applications wherein thermal

noise N per sample is greater than quantization noise , and we shall make this assump-

tion within our work as well. Most of the work herein is therefor conducted using 32 bit

floating point representations for simplicity. This presents more than enough dynamic

range for our applications, and can be reduced in precision in future work for many of

them.

Information Theory can be be used to express an upper bound on channel capacity [20].

This defines the maximum information throughput in bits per second per hertz which can traverse a wireless channel. Most commonly this is expressed for a single transmitter, single receiver channel where the impairment is given by AWGN, and signal power is expressed as a signal to noise ratio (SNR) relating the signal power to the noise power.

This capacity equation is given traditionally by the following:

 P  C = W log2 1 + (2.1) N0W

Here we obtain a maximum capacity C in bits per second based on the transmit signal

power P , the noise power N0, and the bandwidth W . This is considered one of the most

important bounds in communications, as it characterizes a fundamental limit on how

much information we can transmit over a given channel with a specific signal and noise Timothy J. O’Shea Chapter 2. Background 15 power.

Achieving this bound has driven much of communications research and algorithm iter- ation over the past 50 years as we seek systems which operate closer and closer to this bound. Each of these specific modulation and coding scheme can also be expressed with an expected analytic error bound given a similar set of operating conditions (SNR, band- width) whose information capacity is governed by this bound.

The Shannon bound however, does not in its common form, address multi-user capacity

(aggregate bits per second per hertz for all users sharing a common channel) or realis- tic wireless channel impairments beyond thermal noise (e.g. fading, distortion, or other sources of impairment). Numerous more complex formulations of capacity do exist for more complex modulation and channel models, but no general solution exists for the multi-user, realistically impaired channel and arbitrary emitter modulation case.

2.1.2 Radio Channel Models

Modeling of radio propagation channels is a highly mature field which has developed throughout the history of communications systems. Typical channel models allow us to come up with simplified parametric models which reasonably approximate the effects seen over the wireless channel. High quality monte-carlo simulation algorithms do exist as well which can produce realistic sample by sample distributions of impairment models for simulation [46], but often do not have a compact form. Timothy J. O’Shea Chapter 2. Background 16

These channel models often may be used to perform analytic optimization, as they have been in many instance, to simulate the transmission and reception of a wireless signal in a monte-carlo sense, or as discussed later, may be used directly in the development of domain specific attention models or simulation models including within end-to-end radio system optimization processes.

Thermal Noise is a key physical limitation in analog to digital conversion which limits sensitivity and achievable signal to noise ratios for any given received signal power [47] and bandwidth. We can model the absolute thermal noise power as P = kTB where P is the power Watts, k is Boltzmanns constant (1.38 1023 Joules/Kelvin), T is the temperature × at the ADC in Kelvin, and B is the conversion bandwidth (sample rate) in Hz. Given this fundamental bound on SNR for specific receier powers due to device physics, all radio systems must function within the finite SNR margin governed by this limit. For simu- lation and analysis, this is modeled accurately as an additive process of white Gaussian noise (AWGN) where received samples (r) are each the sum of some transmitted signal (t) and a noise component (N). This can be expressed as r = s+N where Nthermal N(0, σN ) ∼ 2 2 2 2 and s /σ expresses the SNR. This is typically expressed in dB as 10log10( s /σ ) rather | | N | | N than as a ratio.

Delay spread occurs during wireless propagation when multiple coppies of a transmitted signal are received at differing delays and phase offsets. This is commonly due to multi- path fading, or the summation of many different propagtion paths either direct, reflective or otherwise, arriving and summing together at a receive antenna additively. For an Timothy J. O’Shea Chapter 2. Background 17 impulsive channel, shown in figure 2.2 for σ = 0, all energy arrives at a single time-delay in the impulse response.

Figure 2.2: Impulse Response Plots of Varying Delay Spreads

For a frequency-selective fading channel (with non-zero delay spread), this energy arrives at some combination of random time intervals which combine additively at the receive antenna. This results in an impulse response which contains power at a range of different frequencies, often following a distribution such as Rayleigh (where there is no dominant mode or line-of-sight) or Rician (when a large line-of-sight component is present). Figure

2.2 shows fading channels for σ = 0.5, 1.0, 2.0 which we will refer to light, medium and harsh fading conditions later on. Typically this impulse response is considered stationary for the ’coherence time’ of the channel, and this is the assumption we will be making in many of our experiments later where, for instance we convolve a single example of 1024 time-samples with a single random impulse response as a simplifying assumption. This sort of modeling is used routinely in communications systems today.

Timing Offsets is also present in all wireless systems, where path lengths and propaga- Timothy J. O’Shea Chapter 2. Background 18

tion times can change based on radio mobility or changing path lengths due to reflection,

refraction, dispersion etc. This is of course governed by the the propagation of radio-

frequency waves at the speed of light (c = 3 108) over some distance (d), where the ×

time delay (τ0) is given by d/c. This time-delay τ0 can be treated as a random process for

simulation purposes, and is typically estimated in a radio receiver through the process

of synchronization. Most commonly, the use of a matched filter to some set of reference

tones at the beginning of a transmission, allows for time of arrival estimation and the

extraction of a received signal from the beginning of a single transmission.

Clock Offsets occur because physically seperated radios (e.g. a base station and a hand-

set) typically have seperate free running clocks from which the digital sampling rate and

center-frequency tuning oscillator signals are derived. Free running clock rates can be

treated as a gaussian random walk process, where they are stable on short time-intervals

and stability decreases looking at larger time intervals. This is typically characterized in

the hardware specification of a given hardware device in terms of expected clock error in

parts per million (PPM) or parts per billion (PPB). For short time-intervals we can make

the assumption of stationarity and assign a fixed estimate for symbol rate offset (SRO) and

carrier frequency offset (CFO) between a transmitter and a reciever, or between a received

signal and a receiver. Motion of transmitters, receivers, or reflecters can additionally in-

troduce SRO or CFO through the Doppler effect, where the CFO due to motion (∆fdoppler)

∆v is given by ∆fdoppler = c Fc, where ∆v is the difference in the velocity of the transmitter and receiver along the path of transmission, and Fc is the center frequency of the signal Timothy J. O’Shea Chapter 2. Background 19 emitter. Generally the CFO and SRO incident on a wireless receiver are a combination of offset due to doppler and offset due to random clock offsets between hardware devices.

In much of our work, focused on short-time examples, we assume coherence of sample rate and center frequency over a small number of samples in one example, and randomly draw CFO and SRO from a normal distribution. In the case of CFO, we a assume a carrier frequency distribution ∆Fc N(0, σCFO) and in the case of SRO, we assumpe a small ∼ resampling ratio near one, ∆R N(1, σSRO). ∼

Aggregate Effects are present in any real system, where all of the above uncertainties about a channel are combined into a single simplified wireless propagation model. We can express this as a transmitted signal, s(t), purturbed by a number of channel effects over the air before being received as r(t) at the receiver. Considering the effects of time delay, time dilation, complex carrier phase rotation, carrier frequency offset, additive thermal noise, and channel impulse responses being convolved with the signal, all random time- varying processes. A closed form of the analytic transform between time varying signals s(t) and r(t) including each of these effects can be approximated as shown in the equation below.

Z τ0+T j2π∆Fc(t)/Fs r(t) = e s(t τ∆R(t))h(τ) + nthermal(t) (2.2) τ=τ0 −

Unfortunately, such an expression is quite unwieldy when performing analytic optimiza- tion of estimators in closed form, involving interpolation with a time-varying function Timothy J. O’Shea Chapter 2. Background 20 delay function, and integration with a time-varying impulse response. To simplify this, in many cases, the simplied expression below is used.

r(t) = s(t τ0) + Nthermal (2.3) −

When considring time and frequency offsets a slightly more involved expression is also commonly used.

j2π∆Fct r(t) = e s(t τ0) + Nthermal (2.4) −

Since the focus for many estimators focuses on the structure of s(t), which contains well formed structures such as the following for quadrature phase shift keying (QPSK) when considering perfectly sampled symbol periods.

s(t) = ej(2πN/4+π/4),N 0, 1, 2, 3 (2.5) ∈ { }

Such structured forms of s(t) and simplified AWGN-only propagation models are key to clever derivation of estimators today which are specialized to specific forms of s(t)

(e.g. realizing s(t)4 falls on a single point for all N). However once the more complex or nonlinear cumbersome analytic channel model is introduced and/or many different s(t) transmitted signal structures need to be considered, this kind of manual analytic trick Timothy J. O’Shea Chapter 2. Background 21 begins to break down quite rapidly and require practical model simplifications to remain tractable.

2.2 Cognitive Radio

Cognitive radio [48, 31] is a field which explores the potential ways in which our radio and mobile devices can behave in much smarter and more efficient ways by leveraging artificial intelligence to make better, more informed decisions and employ improved con- trol systems and channel access schemes.

Commonly examples of this include radios saving power by intelligently searching for towers based on expected locationd and distributions, or historical information, conduct- ing hand-off more intelligently, managing finite resources (typically power and spectrum) efficiently, and tuning RF communications systems and front-end parameters such as gain, filtering, tuning or otherwise in order to improve radio performance [49].

Perhaps the most widely published applications of cognitive radio is that of dynamic spectrum access (DSA) [50, 51] which seeks to increase spectrum usage and efficiency by allowing for much more dynamic spectrum sharing by secondary users through intelli- gent sensing, radio user identification, and non-invasive access strategies designed not to harm primary spectrum users, but to use spectrum vacancies and holes available in frequency and temporal vacancies. Timothy J. O’Shea Chapter 2. Background 22

Unfortunately, many of the techniques investigated within the first surge of interest in cognitive radio and dynamic spectrum access (before the first cognitive radio winter) at- tempted to solve very specific sensing or control system problems through a process of specialized modeling of specific scenario features, processes, and distributions (often only for one specific primary user or frequency band). This resulted in a number of potential end solutions, for instance for inter-operability optimizations specifically with TV broad- cast signals in TV broadcast bands, or control protocols to maximize fairness among shar- ing secondary access nodes, however by and large it did not provide a general solution which allowed us to generalize spectrum sensing, spectrum access, and control optimiza- tion widely for many different scenarios, emitters, and bands. Due to this narrow appi- cability and slow moving spectrum policy which has refused to allow for sensing based secondary spectrum access, much of the research in this field yielded relatively narrow interest and effected relatively minimal change in radio system design and deployment as whole.

2.2.1 Sensing Techniques

One of the earliest applications of artificial intelligence in radio systems was that of spec- trum sensing for emitter identification. This is often a multi-stage expert system which

first performs a form of wideband energy detection, often by identifying concentrated energy within the power spectrum density, localizing and extracting carriers, and then further characterizing these carriers through an iterative process of carrier estimation and Timothy J. O’Shea Chapter 2. Background 23 classification.

There have been numerous attempts to use neural network based approximations espe- cially in the latter stage of signal classification (single signal identification on top of expert feature sets), but many of them have relied on preprocessed feature spaces as input such as the spectral correlation function (SCF) [52] to provide a relatively simple neural net- work mapping tasks. The scope of previous expert sensing methods is quite large, and we explore it partially in more depth in the later sensing section.

2.2.2 Control Modeling

Control system modeling in radio systems is another interesting task which was ad- dressed within the scope of Cognitive Radio problems and publications. Control op- timization approaches have been applied to many tasks such as channel frequency se- lection in dynamic spectrum access systems and for avoidance of malicious users such as following tone jammers. Two of the most commonly considered approaches include modeling access opportunities of whitespace as a hidden Markov model (HMM) [53], as well as modeling collective control problems as Game Theoretical problems [54]. Each of these models and solutions is however unfortunately quite highly specialized for the specific scenario, band, and primary user considered for many of these works.

Works also considered the effects of optimal radio mode and tuning control using a va- riety of methods [55], including the use of expert planning approaches [56] such as the Timothy J. O’Shea Chapter 2. Background 24 popular observe orient decide act (OODA)-loop concept. However, these two approachs are also quite reliant on expert knowledge, modeling, descriptions, and specific scenario- centric learning. We hope that with the methods presented here we can begin to devise and build solutions to these classes of problems which generalize much better without significant expert model construction and manual adaptation needed.

2.3 Deep Learning Models

The study of deep learning has recently brought together a collection of powerful opti- mization tools, network architectural tools, regularization knowlede, high performance implementation, and other techniques which can be used to learn powerful models from datasets and simulators. Here, we highlight a number of key ideas and enablers in greater depth for background. These will be employed in later sections to several core problems in radio signal processing.

2.3.1 Error Feedback and Objectives

At its core, deep learning as it exists today is focused on the optimization of large para- metric network models which can accommodate very high degrees of freedom, non-linear transformations, and deep hierarchical structure.

Today, such networks define one or more loss function (L ) between network output val- Timothy J. O’Shea Chapter 2. Background 25

Table 2.1: List of widely used NN optimization loss functions

Name L (y, ˆy) Mean Squared Error (MSE) y ˆy 2 k − k Mean Absolute Error (MAE) y ˆy | − | Binary cross-entropy (BCE) y log(ˆy) (1 y) log(1 ˆy) −1 P−N − − − Categorical cross-entropy (CCE) N i=0 [yi log(ˆyi) + (1 yi) log(1 ˆyi)] 1 PN − − Log-cosh log (cosh (yi ˆyi)) N i=0 − ( 1 (y ˆy )2 abs(y ˆy ) < 1 1 PN 2 i i i i Huber N i=0 − − (yi ˆyi) abs(yi ˆyi) 1 − − ≥ ues (ˆy) and target network output values (y) (where yi denotes the i’th output value), and use a form of global error feedback from this loss function in order to train network parameters (also referred to as learning). Artificial neural networks (ANNs or just NNs) have long relied on back-propagation [57] of error gradients to fit the parameters in their networks. At the simplest form, the iterative weight update process of back-propagation of some function ˆy = f(x, θ) is given by, the following simple weight update equation with a learning rate (η).

∂L (y, ˆy) ∂L (y, f(x, θ)) θn+1 = θn η = θn η (2.6) − ∂θ − ∂θ

This gradient can be derived in an automated fashion using automated differentiation, for very complex functions representing entire networks. A key enabler for the flexibility and rapid speed at which deep learning architectures are able to evolve today. One SGD

∂L (y,f(x,θ)) weight update evaluation of ∂θ is often referred to as a backwards pass, while network evaluation of ˆy = f(x, θ) is often referred to as a forwards pass. Timothy J. O’Shea Chapter 2. Background 26

This form of iterative weight update through SGD with global error feedback through

back-propagation is used today in virtually all DL model training applications. A wide

variety of loss functions are used for different applications, but many of the most com-

monly used loss functions include mean squared error (MSE) and categorical cross-entropy

(CCE) are shown in table 3.2. MSE is commonly used for real-valued regression problems,

while CCE is typically used for classification problems. In classification with CCE loss

fucntion a so called ”one-hot” encoding is typically used, where the output targets (yi)

take the form of a zero vector with a one at the index of the correct class label. In this case

output predictions ˆyi for each class i of N, fall on the range (0, 1) which can be enforced with an output activation function with bounded (0, 1) output range such as sigmoid or softmax (softmax is typically used). When bounded in this way, these output predictions are often referred to as pseudo-probabilities, since they are trained to predict the discrete target probabilities p(yi = 1) or p(yi = 0) for each output index.

SGD has improved drastically since the basic formulation shown in equation 2.6. Mo- mentum [58, 59] is an important enhancement on the simple formulation of SGD shown above. With momentum, the learning rate η is updated dynamically based on the stabil- ity of the gradient in each direction to prevent oscillation and to accelerate descent across large nearly flat regions. The simple form of the gradient update expression with mo- mentum is given in equation 2.7, where velocity v is now updated iteratively and used to derive new weights θ. Timothy J. O’Shea Chapter 2. Background 27

∂L (y, ˆy) v = γv + α n+1 n+1 ∂x (2.7)

θn+1 = θn vn+1 −

This approach was accelerated further using Nesterov’s approach [60] which improves momentum updates assuming the target loss manifold is a smooth function. Within the past handful of years, both RMSProp [61] and Adam [62] have become widely used which incorporate gradient normalization into their momentum updates. In Adam, which is used in the vast majority of the work included herein, the update equation is given in equation 2.8.

∂L (y, ˆy) mn+1 = β1mn + (1 β1) − ∂x 2 ∂L (y, ˆy) vn+1 = β2vn + (1 β2) − ∂x m mˆ = n+1 n+1 1 βn+1 (2.8) − 1 v v ˆ = n+1 n+1 1 βn+1 − 2 ηmnˆ+1 θn+1 = θn p − vnˆ+1 + 

Even more recently, the problem of learning rate control during SGD have been read- dressed in novel ways which provide faster optimization (often at the cost of increased computational complexity per iteration). These include the use of curvature and gradient variance in a closed loop system [63], as well as casting the learning rate tuning problem as a separate reinforcement learning problem naively [64] (e.g. learning to learn faster). Timothy J. O’Shea Chapter 2. Background 28

These methods have shown promising results, but are not in wide-spread use at this time and appear to provide relatively incremental performance improvements in our limited experimentation.

There has been significant discussion lately surrounding whether global error feedback is really appropriate, optimal or biologically plausible within the human brain. The notion of a global loss function and global error feedback both seem unlikely in the human mind.

More plausible formulations generally include a more localized form of loss computation and a more localized and distributed form of error feedback. Numerous ideas on im- proved optimization are currently under development, and will almost certainly provide improvements in network training within the coming years. Key explorations in this field include Feedback Alignment [65], Equilibrium Propagation [66], Inverse Autoregressive

Flow [67], and others. This is a very active area of research, and a challenging field.

Most of the work herein relies on mature global back-propagation using forms of SGD for network optimization due to their maturity, effectiveness (current state of the art on most tasks) and the availability of optimized implementations. However, given the promising nature of emerging basic research into distributed and local-feedback optimization meth- ods (which attempt to mirror more closely what the human brain is believed to do, i.e. no single global loss function or global clock synchronization) and the similarity of the network functions on which they may operate, we expect many of these methods will be readily applicable to lend further improvements to much of the work shown here. Timothy J. O’Shea Chapter 2. Background 29

2.3.2 Network Model Primitives

Neural network architectures have come quite a long way. From early use of a very small

number of ’perceptrons’, the formulation of a feed-forward memoryless single neuron

has been relatively straight forward given by equation 2.9. Here a set of input values X

of size (1,N) is concatenated with a ones vector (to include a bias term) of size (1, 1) and

multiplied with a weight vector of size ((N +1),M) to produce an output vector H of size

(M, 1).

Y = f(W X_1) (2.9) ×

An output value Y is then produced using some activation function f which may be lin- ear (e.g. the identify function) or it may be a non-linear function such as a sigmoid or rectified linear unit. Commonly used activation functions, f, are given in table 2.2. Sig- moid activation functions have a long and rich history in literature, but today a number of different activations are used. The simple rectified linear unit (ReLU) activation [68] has been used increasingly in recent times instead of the sigmoid due to a number of important properties. Computationally it is much cheaper to compute, as is its gradient, and training typically converges much faster than when using smooth sigmoid or tanh activations which suffer more from the vanishing gradient problem [69] (e.g. successfully using back-prop through many layers), where gradient contributions to loss can differ by orders of magnitude between subsequent layers making optimization very slow. Timothy J. O’Shea Chapter 2. Background 30

Table 2.2: List of activation functions

Name Function f(x) Range Linear x ( , ) −∞ ∞ ReLU [68] max(0, xi) [0, ) ( ∞ αx, for x < 0 Leaky ReLU [70] [ , ) x, for x 0 −∞ ∞ ≥ TanH tanh(xi) ( 1, 1) − ArcTan. tan−1(x) ( , ) 1 −∞ ∞ Sigmoid 1+e−x (0, 1) exi (0, 1) SoftMax [71] PN exj ( j=0 0, for x < 0 Step [0, 1] 1, for x 0 ( ≥ x, for x > 0 ELU [72] (0, ) αex α, for x 0 ∞ ( − ≤ x, for x > 0 SELU [73] λ (0, ) αex α, for x 0 ∞ − ≤ SoftPlus [74] ln(1 + ex) (0, ) ∞

Each of these activations is expressed compactly in table 2.2, in the case of tanh, sigmoid, and softmax, exponentiation operations are used for forward passes, while in ReLU units, a simple peace-wise linear transfer is incredibly cheap to compute. Below, α denotes some leaky (non-activaited) coefficient, while λ denotes a scaling factor; both of these are considered hyper-parameters (e.g. defined with the network architecture and not updated durring SGD). In each case, x denotes a single output neuron (activation of each output is independent) except in the case of SoftMax, where each output xi is scaled by exponentiated versions of all outputs xj in the layer.

The perceptron description given in 2.9 and illustrated in figure 2.3 provides a simple, Timothy J. O’Shea Chapter 2. Background 31

Figure 2.3: A single fully connected neuron

highly compact matrix multiplication operation followed by some activation which can generally be computed concurrently for each element in the matrix. This class of layer is typically referred to as a fully-connected (or Dense) layer, where the weight vector dimension is the product of the input and output dimensions. This is the most expressive layer, but also contains the highest free-parameter count, making it both flexible and data- hungry to obtain good solutions to fit all the parameter values well.

Figure 2.4: A simple 1D 2-long 2-filter convolutional layer Timothy J. O’Shea Chapter 2. Background 32

One solution for reducing the free-parameter count and introducing invariance properties which may be desired in certain layers is by leveraging the convolutional layer [75, 1] which can be realized commonly for 1D,2D,3D or higher dimensional input spaces. Here, the weight vector W is decomposed into a number of distinct filter channels as shown in figure 2.4, where each filter has some size smaller than the input dimension, and is strided across the input vector typically at some periodic interval. This has two enormous benefits. First, if the input is a translation invariant domain such as a signal arriving at random time offsets, or an image occurring at random X,Y translations, this forms a powerful regularization which learns the same features at all offsets within the input.

And Second, the number of free-parameters is virtually always drastically reduced versus the equivalent fully connected layer, reducing the number of examples required to obtain similar accuracy on the lower number of free-parameters which must be accurately fit.

Figure 2.5: A sequence of 2D convolutional layers from AlexNet [1]

Dilated convolutions [76] deserve a special mention within our discussion of radio time- series as well. Their recent use in neural networks [2] has been conducted in the audio and voice processing domain where dyadically scaled features of many temporal support Timothy J. O’Shea Chapter 2. Background 33 widths contribute key features within both music and natural language. However, this property of helping (in multiple layer form) to represent exponentially different scalings of raw features is critical in the radio domain as well, where high samples rates are used, and features may easily span 10x to 1000x or more in varying temporal feature support width.

Figure 2.6: An example dilated convolution structure from WaveNet [2]

Each of these constructs presumes a feed-forward model for information flow (e.g. each layer only depends on preceding layers’ outputs). Recurrent layers relax this assumption and allow for a ’memory’ connection within a single layer. This is a powerful tool which has been demonstrated to be highly effective in temporal sequence modeling [77, 78] par- tially due to the fact that it can relax the simplifying Markov assumption which is made in the case of a HMM. Timothy J. O’Shea Chapter 2. Background 34

2.3.3 Regularization

One of the core problems with stochastic gradient descent based methods (and many other machine learning methods) is the propensity of the training process to overfit the model to training set data. To avoid overfitting, or aligning the model solution more closely with the specific training examples than the general solution to the problem they represent, a number of solutions have been proposed and used over time. Simple forms of regularization may focus on the L1 or L2 norm of either activations or weight vectors, attempting to push unused or rarely used conditions to zero, or reduce high magnitude overfitting to specific cases. Ridge regression attempts to strike an optimal balance be- tween these factors.

Dropout introduces an entirely new form of regularization [1, 3], which embraces the combinatorially large number of neurons and paths through a network, and probabilisti- cally zeros neuron outputs during the training process, effectively removing connection as shown in figure 2.7.

By doing this, networks can not overly rely on any one specific neuron or network path for a single use case or example, and instead can be seen as training an exponentially large ensemble model of all possible sub-graphs of neurons through the network randomly, an enormous computational gain over actually training that many separate independent graphs. The effect of Dropout is quite stark, as shown in the exaple in figure 2.8. Here, when training on the cononical Modified National Institute of Standards and Technology Timothy J. O’Shea Chapter 2. Background 35

Figure 2.7: Dropout effect on network connectivity, from [3]

(MNIST) dataset, without dropout training loss goes near zero quickly, but overfits with validation loss plateauing at a high level. With dropout however, training and validation loss track much more closely, and overfitting does not occur until much later, and to a much lesser degree, causing much better generalization while training against a very small (500 example) subset from MNIST.

DropConnect [79] was more recently introduced, employing the same variety of proba- bilistic dropout on network paths during training with slightly improved performance vs Dropout, but dropping out fine grained neuron inputs rather than outputs. Unfortu- nately DropConnect requires an increase in computational complexity when computing ensemble outputs, and is not nearly as simple to implement as Dropout. Its adoption and widespread usage has not been as notable as Dropout at this time.

More recently, batch normalization [80] has begun to be adopted widely as another form of regularization (especcially for convolutional layers) and functions surprisingly well. Timothy J. O’Shea Chapter 2. Background 36

Figure 2.8: Example Effect of Dropout on Training and Validation Loss

In batch normalization, mean and variance of inter-layer activations are normalized to zero mean and unit variance for mini-batches during training, resulting in a more sta- ble covariance properties, and providing a surprisingly good regularization property.

Currently this is one of the most widely used regularization methods for state of the art

CNNs. Very recently, an approach has been devised [73] which employs carefully crafted network weight initializations and scaled exponential linear unitss (SELUs) in order to guarantee the same inter-layer activation properties (normalization) without explicitly having to scale them. This can result in significantly faster convergence and lower com- putational complexity in some cases. Timothy J. O’Shea Chapter 2. Background 37

2.3.4 Architectural Strategies

There are a number of high level architecture design strategies which have played im- portant roles in deep neural network design over the past few years. Beyond basic layer design, higher level connectivity design is important in shaping the flow of information, combining features from different regions within larger networks, and achieving the right structure with a limited number of free parameters. Early attempts at providing paths through the network to combine low level inputs and features with higher level features included the use of highway networks [81], which showed improvements in some cases.

However more recently, residual networks (ResNets) [4] have become widely adopted within computer vision due to their ability to fit many features of varying scale, leverage depth effectively, and to not heavily overfit to training sets. They are typically used with batch normalization for regularization, a single ’residual unit’ is shown in figure 2.9.

Figure 2.9: A single residual network unit, from [4]

Many of these units can be stacked into a ’residual stack’, to form a network where fea- tures may easily pass through many layers of embedding, or may bypass embeddings, and may fit optimal sets of features which mix both types of features at many layers of Timothy J. O’Shea Chapter 2. Background 38 abstraction. This is an important breakthrough in multi-scale learning, and one that gen- erally represents the state of the art today in computer vision architectures.

Figure 2.10: An exemplary residual network stack, from [4]

Attention or saliency is another key high level architectural design consideration in many networks. Many networks have a hard time scaling to very large input sizes, so for tasks such as the google street view challenge, which must consume very high resolution im- agery, and discriminate house number digits, some method for directing attention to the digits before discriminating can drastically reduce network complexity by introducing domain appropriate transforms. In the case of vision the 2D Affine transform works very well at resolving scale, translation skew and rotation in input patches. Figure 2.11 illus- trates the spatial transformer network (STN) architecture where a localization network estimates some set of parameters θ which work with a transformer to produce a canoni- cal image, which can be classified using a relatively simple discriminative network.

Many of these architectures were developed for computer vision or for voice, however the high level concepts outlined here are at least as applicable in the radio domain, where high dimensional search spaces may include time, frequency, spatial, polarization or other search spaces with well understood transforms as discussed later. Timothy J. O’Shea Chapter 2. Background 39

Figure 2.11: Spatial transformer network structure, from [5]

2.3.5 High Performance Computing

Usable computational capacity through high performance computing and powerful algo- rithm expressive models has been a core enabler to deep learning. Since Gordon Moore’s famous statement [82], that the number of components/transistors on an integrated cir- cuit appeared to double every year (later adjusted to every 18 months), we have seen one of the most incredible technological scaling processes in history, driving the growth of computing and computing related industries. Unfortunately, over the past 10 years we have begun to run into limitations on translating this transistor count into growth in useful computation. The cause for this is best illustrated by the plots shown in figure

2.12. Transistor counts continue to scale, however clock speed and single threaded per- formance have largely plateaued and no longer see the same exponential gains each year.

This has led to a growth in the number of cores per processor reaching a growth rate almost equal to that of the number of transistors on chip.

In the past 10+ years, computing has attempted to embrace this many-core future by intro- ducing numerous processing architectures with multi-core or many-core structures, and Timothy J. O’Shea Chapter 2. Background 40

Figure 2.12: Single threading ceiling illustrated, from [6]

introducing many unique programming models to attempt to embrace it. While many hardware architectures have been able to achieve theoretical peak performance numbers which continue to ride Moore’s law, some of them achieved it such as the Cell Broad- band Engine (CBE) [83], the Tile processor [84] and others while placing the vast majority of the burden on the software programmer to effectively balance algorithm distribution, data movement, thread communication, etc between many cores. Unfortunately, this led to a highly limited adoption of such architectures, where significant software develop- ment and tuning of algorithms for specific architectures was required in order to obtain near-theoretical performance numbers. Around this time, we investigated efficient high throughput software radio on the CBE and obtained limited success [85], but ultimately faced very large development times and an end-of-life’d processor roadmap from IBM.

At the same time, GPUs, were rapidly expanding to meet the needs of wide dense ma- Timothy J. O’Shea Chapter 2. Background 41 trix algebra operations required for high rate and high resolution rendering of games and movies using OpenGL. To meet the needs of these rendering algorithms, graphics cards generally turned to many-core solutions where operations could leverage wide architec- tures, busses and concurrent processing at power-efficient clocks speeds and very high

floating point throughput rates.

Around 2007, the notion of general purpose graphic processing unit (GPGPU) computing began to come into the forefront. Nvidia released their Compute Unified Device Architec- ture (CUDA) [86] software development kit, ATI released their Close-to-the-Metal (CTM)

SDK [87], and shortly thereafter OpenCL [88] emerged as an attempted at a mainstream cross-vendor GPGPU programming solution.

CUDA, CTM (now discontinued), and OpenCL have all been used widely in specific ap- plications and generally employ a more programmer friendly architecture than possible with Tile or CBE, however their use in radio signal processing has been somewhat limited to high computation kernels ported to them and tuned. Wideband channelization [89] has seen widespread success in this space along with a variety of kernels [90, 91]. In general these attempts have continued to be plagued by the problem of balancing I/O and com- pute distribution among compute elements in a general way across a heterogeneous set of algorithms.

Theano [92] in 2010 introduced a quite new model, which relied on high level Numpy- like [93] matrix algebra definition in python and efficient data-flow computation graph partitioning, GPU compilation, and mapping and optimization over distributed GPU and Timothy J. O’Shea Chapter 2. Background 42

CPU compute elements. This was a huge step in that it made the programming model for concurrent architectures much more rapid and accessible without significant invest- ment in custom CUDA code, and maintained portability across different CPU and GPU backends. Google followed shortly thereafter with the release of TensorFlow [94] which ultimately improved upon and displaced Theano (a university project) with a fully sup- ported commercial open source project. While AlexNet [1] used CUDA implementations of their convolutional neural network directly, it was very shortly thereafter that Theano and similar languages began to be heavily leveraged for rapid model iteration and neural network prototyping leveraging its high level programming language an highly efficient concurrent GPGPU compute architectures for rapid training.

Theano [92] and TensorFlow [94] in this sense really pioneered an entirely new class of computing, based on the functional programing [95] style definition of very large matrix algebra computation graphs. This capability has so far been heavily leveraged by the machine learning and neural network community in libraries such as Keras [96] which express large Tensor graphs expressing entire networks and efficiently place them down onto multi-CPU or multi-GPU architectures for rapid training and inference. However, the applicability of these models is actually far wider than solely in machine learning, with countless signal processing applications standing to benefit from large functional graph composition, partitioning, kernel synthesis, optimization layout, and orchestration onto large distributed compute architectures. Within the past few years, the growth of high performance computing frameworks centered around deep learning, and leverag- Timothy J. O’Shea Chapter 2. Background 43 ing these core ideas has been astounding: Caffe [97], Chainer [98], Torch [99], PyTorch

[100], MXNet [101], Lasagne [102], and many other frameworks have explored various enhancements and syntaxes for such high level deep learning models.

Figure 2.13: Concurrent GPU vs CPU compute architecture scaling (2017), from [7]

In recent years, the spread between concurrent architectures able to continue to grow and leverage Moore’s law, and those that are more limited in their ability to scale to wide ar- chitectures has widened greatly, as illustrated in figure 2.13. At this point, virtually every compute architecture is now following suit and providing very high throughput, wide tensor operations which scale very well with neural network primitives. Not all algo- rithms scale well on such architectures, such as tight sequential single loop dependen- cies, but the class represented by most wide and deep neural networks maps incredibly well and efficiently onto such wide architectures where they can be partitioned readily for both pipeline and data parallelism automatically from large functional data-flow graph Timothy J. O’Shea Chapter 2. Background 44 definitions. This synergy between concurrent model and compute architectures is one of the key enablers for the adoption of deep learning models, which offer highly efficient re- alizations versus algorithms which rely on more iterative or tightly looped designs. This ensures that any algorithm or approximation fit to such a network will likely map well to the distributed architectures which realize well on real world scalable compute architec- tures and play well with the limitations imposed on us due to device physics.

2.3.6 Model Search

Since the original AlexNet paper [1] there have been numerous improvements in image recognition architectures. Some of these have been due to significant algorithm enhance- ments and others have been due to simple architectural and hyper-parameter adjustments in the architectural elements or training procedures. This general problem of how to best

find an architecture for some learning problem, especially for new problems which have not been heavily explored (like vision), is still an open one. There have been a number of attempts to explore this problem of architecture search or hyper-parameter search which have yielded significant steps forward, but tools to address and solve these problems are not yet widely disseminated and it is still a major need among many practitioners.

Approaches which have been explored in recent time to solve this problem include using gradient descent on the hyper-parameters (so called hyper-gradient descent) [103, 104] as well as reinforcement learning driven search processes [105] and evolutionary methods Timothy J. O’Shea Chapter 2. Background 45

[8, 106]. Evolutionary methods seem to currently show some of the most robust results.

Figure 2.14 illustrates the performance of one such evolutionary search for convolutional network models to solve the CIFAR-10 and CIFAR-100 dataset image classification tasks

[107].

Figure 2.14: Evolutionary performance of image classifier search, from [8]

Unfortunately, today the computational resources need for such very large NN model evolutionary search is quite high. As a result we introduce a simpler small scale evolu- tionary strategy later in this work.

2.3.7 Model Introspection

One of the largest critiques of deep learning today is that is can be seen as a ”Black Box” method, in which inputs and output tasks are optimized, but there is little visibility into what is going on inside the model. While there is some truth to this accusation, it is also Timothy J. O’Shea Chapter 2. Background 46 a bit unfair to say a trained neural network is a black box. Aside from the basic intuition of specific layers’ capabilities, there are a number of techniques which can be employed to visualize and measure the effects of what is going on within each layer.

Figure 2.15: Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9]

For low level weights, direct inspection of weight vectors can be informative. Layer 1

CNN weights shown in figure 2.15 can provide some intuition as to what each filter rep- resents. Various rotations and configurations of small low level patterns actually wind up quite close in some cases to the Gabor filters which were previously used as an ex- pert low level feature extractor. However at higher layers in a CNN architectures, the direct meaning of a set of filter weights is not so immediately clear from direct weight inspection.

Figure 2.16: Filter activation visualization in CNNs, from [9]

A popular technique for understanding high level CNN feature meaning is by looking at activations of different features at different layers based on known image stimulus as Timothy J. O’Shea Chapter 2. Background 47 explored in [9]. Certain classes of objects which are known to stimulate class labels, can be seen to activate a number of intermediate feature maps within the image. Example top-9 activation maps are shown in figure 2.16 for a handful high level features. Here, ac- tivations can often be seen to be correlated directly with component features of high level classes by observation. For instance specific facial features may produce activations at one layer and combine to form a full face activation at a higher level as has been demon- strated.

In classification tasks, it is possible to perform gradient descent to find a random im- age which maximally actives some class label. This method was first performed in [108] and then improved in [10] through including a regularization term (requiring a relatively smooth input). By doing this, random inputs can be generated which demonstrate what low level features activate any given activation within a network. Figure 2.17 shows this techniques used on imagery, clearly illustrating some of the visual features of each class which have been captured by the high level class specific feature map activation.

Figure 2.17: Optimization of input images for feature activation, from [10]

Other methods for introspection focus on localizing where in the input vector the con- Timothy J. O’Shea Chapter 2. Background 48 tributions to a feature’s activation occur (a so called saliency map). This can be done in several ways, but one of the most promising recent methods involves differentiating a fea- ture’s activation output with regard to pixels or points in the input image. This method, the gradient class activation map (GradCAM) [11] is a powerful method for localizing and highlighting which regions in an input correspond to which activations. Figure 2.18 illus- trates this technique on dog and cat classes within a single image for a classifier trained on image labels without any location information.

Figure 2.18: GradCAM Saliency Maps for Dogs and Cats, from [11]

From an information theoretical point of view, newer work [12] looks at the performance of each layer of a neural network from an information theoretical viewpoint, measuring the joint information between input, output, and intermediate layers throughout the deep learning training process.

In figure 2.19, we illustrate the information plane, which relates the joint information be- tween raw input (X-axis) and output/targets (Y-axis) to the information contained at each layer of the model during training. Interestingly we can see as training progresses, the Timothy J. O’Shea Chapter 2. Background 49

Figure 2.19: Information theoretic visualization of deep learning, from [12]

layers move to represent more information about the input, while continually represent- ing more joint information with the output, and then finally enter a compression stage where they filter and remove information about the input X while preserving informa- tion about targets, Y. This is an interesting viewpoint for understanding information flow and compression through a so called ’bottleneck’ during DL model training. On the right we see the mean and variance of gradients used to guide the gradient descent, which start with a high SNR (large mean and low variance), and throughout training decrease in SNR gradually until they no longer possess significant meaningful gradient informa- tion to further guide the solution. Such an information centric view is quite important when considering deep learning for numerous communications and signal processing tasks where preservation or compression of information throughout the networks is of- ten desired, and a solid understanding of how information is preserved or compressed can be helpful. Chapter 3

Learning to Communicate

Since virtually the beginning of radio, radio transceivers and waveforms have been con- ceived through human design. Original electromagnetic (EM) communications systems such as the telegraph and the spark gap transmitter [109] were practical due to hardware and EM understanding at the time.

Physical layer designs grew increasingly more complex as multiple access schemes such as frequency-division were introduced to allow additional users, higher data rates, in- creased device power efficiency and decreased cost. In 1948 Shannon introduced infor- mation theory and the notion of optimal channel capacity to the world, defining the fun- damental problem of communication as, reproducing at one point either exactly or ap- proximately a message selected at another point.” [20] This placed a theoretical upper bound on the capabilities of single antenna transceivers over a Gaussian channel, but it

50 Timothy J. O’Shea Chapter 3. Learning to Communicate 51 did not inform radio designers specifically how to attain those levels of performance.

Figure 3.1: Illustration of the many modular algorithms present in a modern wireless physical layer modem such as LTE

Since then, radio engineers have iterated through numerous modulation, coding, and ra- dio design approaches every few years in an attempt to improve capacity, reduce cost and power requirements, and generally push our devices closer to these capacity bounds.

In today’s world, modern modems look something like that shown in figure 3.1, which depicts the physical layer of a modern wireless physical layer such as LTE with its many modular algorithms. Here, each module represents one of numerous intense areas of re- search surrounding optimal coding, MIMO precoding, subframe allocation, modulation, and other tasks which are all composed sequentially and distinctly to form the powerful and efficient standards we use today.

Within each of these modules typically lies some analytic formulation of the wireless channel. In the case of error correction codes, random bit flips may be used when testing Timothy J. O’Shea Chapter 3. Learning to Communicate 52 or validating a code, and for modulations or MIMO coding schemes, Gaussian noise or

Rayleigh fading channels are frequently used to model the propagation channel. In each of these cases, such an approach generally requires simplifying assumptions and modular optimization of individual algorithmic components rather than as a whole.

This has proven to be effective, but generally leaves open the questions, can we do better with more rich information about the real distributions of actual impairments in a spe- cific deployment scenario, and can we do better if we jointly optimize the system rather than building components with rigid interfaces and intermediate values? Can we find a more straightforward way to build complex communications systems which attain sim- ilar performance without the need for thousands of man-hours in engineering, software implementation and optimization time? And can we find such systems which maintain near-Shannon levels of performance while maintaining flexibility to adapt the physical layer more fully than this sort of rigid physical layer algorithm definition will allow?

3.1 The Channel Autoencoder

To answer these questions, we consider again the fundamental task of a radio commu- nications system: reproducing at one point either exactly or approximately a message selected at another point.” [20] This task is strikingly similar to that of an autoencoder, whose objective is to reconstruct some input vector x at the output ˆx and minimize the loss between the two, by learning an encoder and a decoder for some input vector. We Timothy J. O’Shea Chapter 3. Learning to Communicate 53

Figure 3.2: The Fundamental Communications Learning Problem

first introduce this idea in [110] and further refine it in [111].

Figure 3.3: A simple autoencoder for a 2D MNIST image, from [14]

Traditionally an autoencoder is used to learn a lower dimensionality sparse representa-

tion of the input vector x (such as the MNIST digits shown in figure 3.3), which may be non-linear when using non-linear neural network activation functions. This approach for learning encoding, decoding, and sparse representations has the benefits that it can be

fit non-linearly to the distribution of a given input dataset, can be tuned for a specific Timothy J. O’Shea Chapter 3. Learning to Communicate 54

loss function (e.g. MSE, binary cross-entropy (BCE), CCE), and that it can act as a fil-

ter to remove non-structural noise which does not lie within the learned support of the

compressed representation.

Figure 3.4: A Simple Channel Autoencoder

We can formulate the radio communications system problem as a similar autoencoder,

where a message to transmit s, either a k-bit binary vector with M = 2k possible code-

words or an equivalent one-hot codeword vector of length M, is encoded, passed through

some set of channel impairments, and then decoded to recover ˆs, an estimate of the orig- inally transmitted message. The channel layer may be stochastic in nature, as has been regularly used within computer vision actually for its nice regularizing properties (e.g.

[3], [112]).

This channel autoencoder differs from the conventional use of an autoencoder in a few ways, first the intermediate representation of the signal may actually be of higher dimen- sion (as opposed to most autoencoders which seek a sparse representation). Second, the channel layer introduces numerous lossy and mixing impairments rarely seen in other Timothy J. O’Shea Chapter 3. Learning to Communicate 55

configurations (e.g. noise, fading, rotation, etc). We consider s to be a number of bits

k producing 2k = M distinct messages which are encoded into some number, n, of real

or complex valued digital samples. Controlling this ratio of k/n, (further referred to as

(n, k)) for a given sample rate and signal and noise power controls the information rate at

which bits are transmitted over the channel. By modifying these dimensions, any rational

rate system can be obtained using the same approach for arbitrary values of k and n or

simply M and n.

We construct the network using a relatively small network shown in table 3.1 whose di- mensions scale based on M and n. Interestingly, while a single fully connect linear layer in the encoder is fully capable of mapping all codewords to all real valued possible trans- mit symbols in one step, SGD can not find a good solution when only using onle a single layer, and gets stuck in a sub-optimal local minima during training. Adding a second layer of depth to the transmit and receive networks however, allows the network to very rapidly converge to a very good global optimum set of network weights. This is actually an excellent illustration of the work in [37] demonstrating that using a deeper network with a higher dimensional parametric search space actually helps networks converge to more globally optimum solutions, as they are much less likely to become trapped in a local minima simple due to the probabilistic nature of all degrees of freedom not likely aligning in curvature. They are more likely instead in this deeper / higher dimensional space to encounter a saddle point, which is not neccisarily terminal in a gradient descent search when using a strong saddle-free optimization method (some, such as Newton’s Timothy J. O’Shea Chapter 3. Learning to Communicate 56

Table 3.1: Layout of the autoencoder used in Figs. 3.6 and 3.5. It has (2M + 1)(M + n) + 2M trainable parameters, resulting in 62, 791, and 135,944 parameters for the (2,2), (7,4), and (8,8) autoencoder, respectively.

Layer Output dimensions Input M Dense + ReLU M Dense + linear n Normalization n Noise n Dense + ReLU M Dense + softmax M method may have difficulty).

In order to avoid the trivial solution of using very large values for x in the symbol en- coding, to increase the effective SNR over a constant channel noise power, we introduce a transmit normalization layer after the encoder which enforces a constant average power for transmitted symbols during training, as indicated in figure 3.1. This can be done on a per-symbol or per-batch level, and can be enforced in an umber of ways including mean amplitude, mean power, max power, or other similar constraint, yielding quite different results for each in some cases.

In figure 3.5 from [111] we compare the performance of a learned physical layer encoding for block sizes of 2 and 8 bits, and compare to the block/codeword error rate perfor- mance of an uncoded binary phase shift keying (BPSK) modulation. In this case, we have the interesting result that, for a 2-bit codeword size, 2xBPSK and the (2,2) autoencoder obtain the same information rate (by definition), and align on an almost identical error Timothy J. O’Shea Chapter 3. Learning to Communicate 57 rate curve. As we increase the block size to 8 bits, we begin to see the (8,8) autoencoder system outperform the un-coded 8xBPSK system by 1-2 dB at higher SNR values. This indicates that the larger block size (8,8) autoencoder is in fact learning some form of error correction, where its encoding scheme is more robust than the simple BPSK solution.

Figure 3.5: BLER versus Eb/N0 for autoencoder

100

1 10−

2 10−

3 10−

Block error rate Uncoded BPSK (8,8) Autoencoder (8,8) 10 4 − Uncoded BPSK (2,2) Autoencoder (2,2) 10 5 − 2 0 2 4 6 8 10 − Eb/N0 [dB]

In figure 3.6 we consider the comparison of an autoencoder with 4-bit codewords and 7 real valued symbols over the channel. Here, we consider three different baselines, first the uncoded (4,4) BPSK solution which provides the worse performance, and then two baselines using a hamming code with the same 4/7ths rate as the autoencoder. In the case of the hard decision decoder, there is still a 1-2dB gap in performance, while for

MLD decoding, the performance is nearly identical. This is a very promising result as it shows that for small block sizes, the channel autoencoder approach can learn very strong solutions which rival commonly used modulation and error correction codes.

To further understand the solutions learned by this naive autoencoder learning process, Timothy J. O’Shea Chapter 3. Learning to Communicate 58

Figure 3.6: BLER versus Eb/N0 for autoencoder

100

1 10−

2 10−

3 10−

Block error rate Uncoded BPSK (4,4) Hamming (7,4) Hard Decision 10 4 − Autoencoder (7,4) Hamming (7,4) MLD 10 5 − 4 2 0 2 4 6 8 − − Eb/N0 [dB] we can plot the constellations of each learned encoding scheme simply from their input to the channel module. Figure 3.7 illustrates the constellations learned for (2,2), (2,4),

(2,4), and (7,4) schemes, where different power normalization constraints on (2,4) produce different constellations (e.g. 16-PSK or non-standard 16-QAM), and the 7-dimensional encoding space of the (7,4) code is visualized in 2-dimensions using t-SNE [113]. It is pleasing here that the canonical QPSK solution (with random rotation) is achieved for the

(2,2) code, and that the familiar PSK as well as non-rectangular near-optimally packed

16QAM is achieved for (2,4).

The training process for channel autoencoders is an interesting problem in which the model must learn to perform well in low and high SNR conditions, and the channel and training parameters may be manipulated during training. Experimentally, we find that training at a mid-range SNR (8dB Eb/N0) works well, but that varying batch size from small (50) to large (10,000) in two passes works well to effectively train the system. This Timothy J. O’Shea Chapter 3. Learning to Communicate 59

Figure 3.7: Constellations produced by autoencoders using parameters (n, k): (a) (2, 2) (b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-SNE embedding of received symbols.

(a) (b)

(c) (d) is an interesting result, as the batch size has an effect on the effective SNR of the gradi- ents and the average receive symbol locations. In general in computer vision, high SNR images are used, which may have occlusions, permutations, or small objects, but gener- ally do not have white noise competing with the ’signal power’ of an actual visual object.

However, in vision there has also been discussion recently surrounding the use of increas- ing batch sizes, rather than decreasing learning rate throughout training as smaller (and more noisey) step sizes are needed durring optimization.

The choice of transmit normalization is an interesting one which has no clear ’best choice’, Timothy J. O’Shea Chapter 3. Learning to Communicate 60

Table 3.2: Candidate channel autoencoder transmit normalization functions

Tx Norm Method Expression P qP 2 Example Mean Power (EMP) Xt = X(Nx Ns)/ X ∗ i,j k i,j,k qP 2 Batch Mean Power (BMP) Xt = X(Nx Ns Nc)/ X ∗ ∗ i,j,k i,j,k P Batch Mean Ampl. (BMA) Xt = X(Nx Ns Nc)/ i,j,k abs (Xi,j,k) ∗ ∗ q P 2  Batch Mean Max Power (BMMP) Xt = X(Nx Ns Nc)/ max X , 1 ∗ ∗ i,j,k i,j,k but has a significant effect on the learned solution. We consider a number of normaliza- tion functions which map the output of the encoder Xi,j,k to the input to the channel Xt, as

Xt = fnorm(Xenc). Here, Xi,j,k represents a 3 dimensional tensor, over i the example index,

j the sample index within one example, and k complex sample component index (i.e. I

and Q), for one training iteration. The table below provides several possible transmit nor-

malization functions fnorm which can be used. Where Nx is the number of examples, Ns

is the number of samples, and Nc is the number of components per sample (2).

To gather an intuition for the learned solutions of this class of learned constellation in a

traditional 2D (I/Q) single symbol space, we can compute and plot the learned constel-

lations for 2-QAM through 33-QAM for each normalization strategy below. Interestingly,

since we can map to any number of codewords trivially with this approach, we don’t need

an integer number of bits to transmit, only an integer number of codewords, leading to

numerous possible rate adaptation possibilities beyond the traditional 2N constellations used today for QAM.

First, in figure 3.8 we show using the symbol power constraint per example (EMP), in this case each symbol takes on an average power of 1, leading to conventional constant Timothy J. O’Shea Chapter 3. Learning to Communicate 61

Figure 3.8: Learned QAM Modes for Example Mean Power (EMP)

modulous solutions of phase-shif keying (PSK).

Figure 3.9: Learned QAM Modes for Batch Mean Power (BMP) Timothy J. O’Shea Chapter 3. Learning to Communicate 62

In figure 3.9 we use the average symbol power over an entire batch, which frees each indi-

vidual symbol up to vary to some degree as long as the mean is constrained. Here, we be-

gin to see multi-level constellations form which are quite interesting and non-conventional.

However, one interesting case here is that of 5-QAM where it has learned a relatively con-

stant power constellation which differs from BMA.

Figure 3.10: Learned QAM Modes for Batch Mean Amplitude (BMA)

Figure 3.10 shows the batch mean amplitude mode, where again we obtain a number of

novel solutions, such as for 5-QAM, where we obtain a QPSK looking constellation which

also uses the zero-power mode as a 5th constellation point.

Numerous additional constraints are possible, in figure 3.11 we use a constraint which

2 limits mean power per batch, but considers max(Xi,j,k, 1) before averaging to avoid overly incentivizing low-power constellation points (e.g. all points under the average power are Timothy J. O’Shea Chapter 3. Learning to Communicate 63

Figure 3.11: Learned QAM Modes for Batch Mean Max Power (BMMP)

of equal penalty). These for example might lead to results with very poor peak to average

power ratio (PAPR) (leading to poor amplifier efficiency).

These results are of course only for a single symbol, by scaling values of n and k, we can design a system which encodes an arbitrary number of bits into an arbitrary number of symbols. When encoding across multiple symbols, a typical solution appears to be a unique non-standard 2k-QAM constellation for each symbol, and then some kind of trellis-like combining across multiple symbols to obtain good coding gain. Examples of this are shown in figure 3.12 where we encode 2-bit, 4-bit, and 8-bit messages into groups of 4 sequential symbols. In this case, additional error correction capacity is obtained vs the single symbol form and a distinct QAM arrangement is learned for each symbol with a highly non-intuitive arrangement. Here one codeword corresponds to a point in each Timothy J. O’Shea Chapter 3. Learning to Communicate 64

Figure 3.12: Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit)

of the four spaces. While two constellation points may be close together in one symbol, the points corresponding to the same message will be far apart in another symbol time, allowing for non-linear combining to perform an efficient representation and decoding over all the dimensions.

This method works surprisingly well, but one of the key challenges with it is scaling to much large codeword sizes such as the 1000+ bits used in modern turbo codes. When Timothy J. O’Shea Chapter 3. Learning to Communicate 65

k using LCCE we must select 2 codeword indices (messages), scaling our network expo- nentially as bits are added. One solution is to use k binary inputs and k sigmoid binary bit outputs along with a LBCE loss function. In this case, the network scales more lin- early with block size, however we have not yet been able to obtain near optimal capac- ity performance from a network trained in such a fashion. Other strategies for scaling to larger network have been explored very recently within the scope of error correction codes [114, 115, 116] through methods involving partitioning, and leveraging belief prop- agation graphs to seed neural network weights, however significant work remains to allow for scaling these techniques to large codeword sizes, such as are widely used today in modern LTE systems. Ultimately methods such as replicating network structure within the full block size, whether through weight/connection tieing, or through some form of recurrent operaton with state, hold significant promise for solving this problem in the fu- ture and allowing these methods to be competitive with state of the art error correction and modulation schemes.

3.2 Learning to Synchronize with Attention

When learning to discriminate between received symbols in a channel autoencoder (or between classes in a classifier), the discriminative model must generally learn to classify all forms of signal variation which may arrive at the receiver. In radio, permutations due to the channel include additive noise, phase offset, frequency offset, delay spread, inter- Timothy J. O’Shea Chapter 3. Learning to Communicate 66 ference, and many other distortions such as hardware non-linearities and mixer inter- modulation products. Previous results were shown only with AWGN impairments, how- ever real world systems include all of these effects and more.

Figure 3.13: Spatial Transformer Example on MNIST Digit from [5]

In computer vision, objects undergo a somewhat analogous set of permutations when being viewed, including scaling, rotation, skew, translation, occlusion, and noise. Since these permutations are geometrically well understood, a domain appropriate parametric transformation such as the 2D Affine transform may be applied to correct them directly as shown in figure 3.14 from [5]. By imparting expert knowledge about the domain ap- propriate parametric transforms, the task of canonicalizing an object may be reduced to estimating a set of parameters and then executing the transform. By splitting a classifica- tion task up into learned parameter estimation (localization), parametric transformation, and learned class descrimination, the model complexity needed to classify a range of permutations on the classes may be greatly simplified. If the parametric transform is im- plemented in a way in which it can maintain its differentiability, both localization and Timothy J. O’Shea Chapter 3. Learning to Communicate 67 discrimination networks may be trained in an end-to-end fashion as a single task (e.g. minimize CCE) by using back-propagation from the global loss function both before and after the transform. This architecture has proven to be very effective for image classi-

fication, such as the google streetview house number challenge, where the localization network helps locate and cononicalize digits and the discriminitive networks classifies digits. Figure 3.14: Radio Transformer Network Architecture

The same architecture can be applied to radio communications problems (as we show in [117, 111]), where current day transformations such as application of equalizer taps, removal of carrier phase and frequency, or timing errors can be applied directly, as long as they can be implemented in a differentiable manner. In this case, we can split the network into a more general (not just spatial) parameter estimation network to estimate

CSI, and a discriminative network to perform symbol estimation (or anyother task), while maintaining our expert knowledge about the domain appropriate transforms in order to simplify the target learning manifold task and often reduce the number of free parameters needed in our model. Since we have imparted expert knowledge about the physical radio Timothy J. O’Shea Chapter 3. Learning to Communicate 68 effects, we have only specialized our solution for the domain in general (e.g. things that happen to all radio signals). This is an important point, since we have not done anything to specialize the parameter estimation or discriminative networks for any one specific signal or modulation type, keeping domain-wide non-signal-specific generality in our model architecture.

To validate the radio transformer network (RTN) approach, we consider several tasks.

First, the performance of a channel autoencoder under a Rayleigh fading channel with a tap length of L = 3. In this case, we allowed the estimated parameters, θ to take the form of h−1, the channel impulse response inverse which can be directly convolved with the received signal to obtain a canonical impulsive copy of the signal. We implement the convolution in differentiable tensor algebra within Keras [96] as a set of dense ma- trix multiplies and adds (the standard tensorflow convolution operation can not be used when both the input and convolution taps are free variables).

In figure 3.15 we illustrate the training complexity reduction for this task, comparing the training loss curve for an autoencoder both with and without the CSI estimation network and transformer in front of the symbol discrimination task. Here, we can see that it con- verges to a solution for both, but in the case of the RTN, it converges much more quickly to a good solution in only a few epochs, and ultimately achieves a much lower final CCE loss (and BLER).

Comparing the performance of the autoencoder with and without the RTN synchronizer on the front, we can observe the fully trained bit error rate performance in figure 3.16. Timothy J. O’Shea Chapter 3. Learning to Communicate 69

Figure 3.15: Autoencoder training loss with and without RTN

1 Autoencoder Autoencoder + RTN 0.8

0.6

0.4

0.2 Categorical cross-entropy loss

0 0 20 40 60 80 100 Training epoch

Figure 3.16: BLER versus Eb/N0 for various communication schemes over a channel with L = 3 Rayleigh fading taps

100

1 10−

2 10− Block error rate 3 10− Autoencoder (8,4) DBPSK(8,7) MLE + Hamming(7,4) Autoencoder (8,4) + RTN 10 4 − 0 5 10 15 20

Eb/N0 [dB] Timothy J. O’Shea Chapter 3. Learning to Communicate 70

Here, the non-RTN version is unable to achieve a level of performance which outperforms the baseline method of MLD DBPSK decoding with a hamming code while the autoen- coder with RTN achieves a significantly better performance result, especially for higher

SNR values. This is quite an exciting result, as it shows that a fully learned approach can leverage expert domain knowledge about radio propagation physics, still maintain full generality among signals, and very quickly learn a good solution which outperforms common baseline levels of performance through the RTN approach of CSI estimation, transformation, and symbol estimation. In this case, the learned model may also benefit from the bias present in the fading channel model (the distribution of the taps), since it is constrained to a set of L=3 Rayleigh fading taps, the solution space is not uniform over all possible real values for h−1 which generally allows the system to specialize better for the actual distribution.

Such a result could be incredibly powerful in a wireless environment, where CSI estima- tion and equalization could be heavily specialized and improved for the delay spread distribution within specific deployment scenarios and conditions, but is also somewhat troubling in that it is increasingly important that the simulations and impairment mod- els used for training sufficiently match the possible channel conditions which may be encountered in the real world at inference time.

This technique is a very general front-end startegy when constructing ANN models for high dimensionality parametric search spaces, to leverage knowledge about appropriate transforms. Results here are shown for the autoencoder and symbol decoding problem, Timothy J. O’Shea Chapter 3. Learning to Communicate 71 but preliminary results show that such an approach can also help in sensing and other tasks such as signal type or modulation recognition or other sorts of signal property la- beling through model learning on RF emissions in the spectrum.

3.3 Multi-User Interference Channel

One of the nice features of the channel autoencoder is the versatility with which it can solve many different formulations of the radio communications problem with variations on the same compact optimization problem, with no need to devise complex new phys- ical layer encoding or signal processing strategies. One important such case is that of the multi-user interference channel, where optimization of some aggregate multi-user capacity is the goal rather than a single transmitter and receiver. This is a critical case in wireless systems as it represents most wireless channels with which we interact on a daily bases, where we share some piece of spectrum (e.g. cellular bands, industrial, scien- tific, and medical radio (ISM) bands, ground mobile radio (GMR) bands) with a number of different users who must somehow share the available spectrum to optimize for some joint objective such as capacity. While multi-user capacity bounds have been derived for specific instances, no general solution exists to bound aggregate capacity under all condi- tions, meaning we do not know how far current day systems are from optimal usage of the interference channel. Unfortunately today, we have a slow iterative process of physical layer design, optimization, analysis, and then manual redesign based on whatever intu- Timothy J. O’Shea Chapter 3. Learning to Communicate 72 ition gleaned from the analysis. Channel autoencoders offer to give us a tool by which to break out of this painful cycle and directly seek to find a globally optimal multi-user physical layer (PHY) scheme from the ground up, optimizing for aggregate capacity or any other pertinent design objective or constraint deemed important for its application.

Figure 3.17: The two-user interference channel seen as a combination of two interfering autoencoders that try to reconstruct their respective messages

Using the same channel autoencoder construct previously used, we can formulate the problem with a new mixing channel within the channel layer of two autoencoders as shown in figure 3.17. Here there are two objectives to minimize, L1 = LCCE(s1, sˆ1) and

L2 = LCCE(s2, sˆ2), the reconstruction loss for user 1 and 2 respectively. These can be treated as a single network, where each optimization step chooses a random batch of independent values for both s1 and s2 and complete a back-propagation step to minimize the two. Encoders in this case only have knowledge of their own transmit codeword, and the network architecture from table 3.3 is used where dimensions [x, x] indicates two separated paths of size x and a dimension of [x] indicates a single path of size x.

When optimizing for multiple loss functions there is often a question of how to combine Timothy J. O’Shea Chapter 3. Learning to Communicate 73

Table 3.3: Layout of the multi-user autoencoder model

Layer Output dimensions Input [M,M] Dense + ReLU [M,M] Dense + linear [n, n] Normalization [n, n] Addition [n] Noise [n, n] Dense + ReLU [M,M] Dense + softmax [M,M] them. This can be done additively, multiplicitively, or many other ways which all have an effect on the optimization process and the form of the resulting error gradients. The most straightforward approach is to simply sum the two loss functions. Unfortunately, when doing this, it is not uncommon for imbalance to occur between the two objectives

(e.g. favoring one user’s CCE loss and therefore BLER over that of another). If equal loss is desired among the loss functions, some means for balancing the loss magnitudes must be used. In this case, we seek to obtain fair performance among two users accessing the same channel. As described in [111], to address this, we adopt the following joint loss term LI with loss weight term αt which is given an initial condition of α = 0.5 and is updated each mini-batch time step t as follows.

LI = αL1 + (1 α)L2 − (3.1) L1 αt+1 = , t > 0 L1 + L2

While this metric is heuristic in nature, it does a good job empirically balancing the two Timothy J. O’Shea Chapter 3. Learning to Communicate 74

Figure 3.18: BLER versus Eb/N0 for the two-user interference channel achieved by the AE and 22k/n-QAM TS for different parameters (n, k)

TS/AE (1, 1) TS/AE (2, 2) TS (4, 4) AE (4, 4) TS (4, 8) AE (4, 8) 100

1 10−

2 10−

3 10− Block error rate

4 10−

10 5 − 0 2 4 6 8 10 12 14

Eb/N0 [dB] loss functions during training to arrive at a PHY with roughly equal BLERs and mean symbol powers.

When comparing the aggregate BLER (and thus multi-user capacity) of such a system with a completely orthogonal QAM based access sharing system such as time-sharing

(orthogonal time access (TDM) from [111]), as is shown in figure 3.18, we observe several important results. First, the time-sharing autoencoder system (TS/AE), outperforms the baseline time-sharing QAM system (TS) as we have previously shown, in this case the autoencoder simply learns a single user access strategy within each of its time-slots. Sec- ondly, the multiuser autoencoder (AE or multi-user (MU)/AE), learns a solution which outperforms the TS/AE system even further. This result is illustrated for both 4-bit and

8-bit codeword sizes over a Gaussian interference channel in figure 3.18. Timothy J. O’Shea Chapter 3. Learning to Communicate 75

Figure 3.19: Learned constellations for the two-user interference channel with parameters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points of Transmitter 1 and 2 are represented by red dots and black crosses, respectively.

(a) (b)

(c)

(d)

In the case of the MU/AE system, an aggregate BLER is achieved of roughly 10−3 for the 4-bit system at around 0.7dB lower Eb/N0, while for the 8-bit system it is around

1dB lower. Offering quite significant potential gains for future multi-user access systems, which generally only stand to improve as additional channel impairments and numbers of users increase.

Inspecting the constellations learned by the MU/AE system helps to provide some intu- ition as to what has been learned. In figure 3.19, we illustrate the constellations learned in the (1,1), (2,2), (4,4), and (4,8) MU/AE configurations.

For the (1,1) system, the solution is a nice, quite easy to interpret solution which a human designer might easily have come up with. Here the system has learned a set of two phase- Timothy J. O’Shea Chapter 3. Learning to Communicate 76 orthogonal BPSK modulations at random rotation, providing in this case, an orthogonal solution which does not reduce the rate of either other user.

For the (2,2) system, the solution begins to become quite interesting. In this case, the solution of a sort of super-position code, where slightly skewed and phase-offset 4-QAM constellations are used by each user within each time-slot is found, where users alternate opportunities as the high powered user. This is not necessarily an intuitive solution, but inspecting the performance curve in figure 3.18, we see that it actually achieves better performance than the obvious solution of purely orthogonal time-slotted QPSK.

For (4,4) and (4,8) systems, this trend of pseudo-orthogonal super-position code learn- ing continues, but solution begin to become increasingly complex and are hard to gather significant intuition from. Inspecting the (4,4) code, we can see that each user for each symbol uses a unique layout of 16-QAM to encode the 4 bits robustly across 4 symbols.

The learned decoding process appears to be able to combine these decision surfaces very effectively into a robust low-error rate system for both cases. For the (4,8) system it is difficult to glean much from the constellation layouts, but we can see that the clusters of non-standard QAM-256 points form roughly oval shaped layouts where the major axis appears to be orthogonal.

The exciting nature of this approach to physical layer design MU-scheme design is that it can seemingly readily be learned for virtually any rate configuration, information density, impairment model, or other set of constraints introduced into the network training pro- cess. This opens up the door for highly efficient multi-user CIFAR schemes to be heavily Timothy J. O’Shea Chapter 3. Learning to Communicate 77 specialized to their deployment domain, impairment distributions, multi-user configura- tions, and potentially higher level traffic patterns and requirements as well. Significant work remains to be done to consider optimal fusion of higher level network traffic re- quirements and source coding on top of the model presented here, as well as scaling the models to additional impairment constraints and higher numbers of users.

3.4 Learning Multi-Antenna Diversity Channels

Many modern radios such as LTE smart phones today, do not use a single antenna el- ement for transmit or receive. In fact, the LTE E-UTRA Physical Layer [118, 119] has required for several years that handsets (UEs) employ at least 2 receive antennas to allow for decoding of 2x2 MIMO [120] modes of transmission. Many phones today actually support 4 antenna receive, standards are now discussing 8x8 modes as a reality in future devices, and 5G test labs are evaluating techniques involving up to 128 base station an- tennas [121], or even 500-1000 antennas in some cases. The motivation for this is clear,

MIMO systems have proven themselves to be invaluable both in extending range at the edge of coverage areas by coding redundancy across multiple propagation modes, and in increasing the achievable capacity in dense urban multi-path rich environments where separate information can be coded across multiple propagation modes in order to increase aggregate throughput to a single or multiple users.

Today, state of the art methods for encoding information at the physical layer for the Timothy J. O’Shea Chapter 3. Learning to Communicate 78

Figure 3.20: Open Loop MIMO Channel Autoencoder Architecture

MIMO channel typically rely on either open-loop (no CSI feedback) space-time block code

(STBC) [122] methods (the simplest being the Alamouti code [123]), or closed-loop (with

CSI feedback used for pre-coding) style spatial multiplexing [124] methods.

First we consider the case of an open-loop MIMO system where no CSI is known at the

transmitter. We can structure this problem for an mt transmit antenna and mr receive

antenna system as an autoencoder as shown in figure 3.20, where each codeword is k bits and spans n time-samples. Here, we encode some block of information s as before, using a learned encoder, pass through a channel model, and then recover an estimate ˆs from the received signal y. The primary difference here is that x now takes the form of a 2D mt n tensor for each example, and y takes the form of a mr n tensor for each example. × × The process for complex MIMO Rayleigh channel matrix (H) generation and complex valued tensor multiplication must be implemented in differentiable tensor form within the channel impairment model, and then the same additive noise layer may be used to Timothy J. O’Shea Chapter 3. Learning to Communicate 79

Figure 3.21: Alamouti Coding Scheme for 2x1 Open Loop MIMO

2x1 Spatial Diversity Code Comparison 100 10−1 10−2 10−3 10−4 2x1 AE No CSI −5

Bit Error10 Rate (BER) 2x1 Alamouti 5 0 5 10 15 20 25 30 − Signal to Noise Radio (dB)

Figure 3.22: Error Rate Performance of Learned Diversity Scheme. impose SNR constraints.

We compare the bit error rate performance of the learned autoencoder-based 2x1 MIMO scheme based on the model in figure 3.20 to the conventional Alamouti code which is also an open-loop 2x1 code shown in figure 3.21.

Results for open-loop are mixed, and not initially as favorable as prior results for au- toencoder or multi-user schemes. In both cases, we compare a (2x1,4) system, where two

QPSK symbols (4 bits) are encoded into two time-slots and one receive antenna. Perfor- Timothy J. O’Shea Chapter 3. Learning to Communicate 80

Figure 3.23: 2x1 MIMO AE, Diagonal H Figure 3.24: 2x1 MIMO AE, Random H mance between the two schemes is similar, however we observe two distinct regions, at low SNR the Alamouti scheme tends to outperform, providing lower bit error rates, while at high SNR, the learned scheme provides a 2-3 dB advantage for obtaining equivalent er- ror rates. An additional comparison incorporating error correction may make sense when comparing performance such as a (4x1,6) scheme where a 3/4 rate code is used to map

6 bits onto two (2x1,4) alamouti code words, while allowing the autoencoder to directly learn a solution to the (4x1,6) problem. However, these results are promising enough to warrant further investigation and promise that strong open-loop schemes may be learned in a similar way.

Inspecting the resulting constellations learned in figures 3.23 and 3.24 we observe that a form of superposition code appears to be learned here as well to satisfy the average power Timothy J. O’Shea Chapter 3. Learning to Communicate 81 constraint. This is an interesting solution, but it suggests that different and/or possibly better results could be obtained by introducing some kind of additional constraint to in- centivize equal power between transmit antenna symbols (as is the case for Alamouti).

This does beg the question to some extent as to whether the parameter search manifold for this problem has several very large local minima, where in this case we have been pulled into one solution which is sub-optimal despite the use of large networks, regular- ization, and infinite (generative) training data.

3.5 Learning MIMO with CSI Feedback

In dense urban environments with many radio reflectors, spatial multiplexing modes

[124] and closed-loop MIMO are commonly used to increase throughput and improve performance from multi-path propagation. These too can be represented through an ap- propriate autoencoder architecture. Figure 3.25 illustrates an autoencoder architecture for learning such a MIMO scheme which incorporates CSI (e.g. closed loop) into the transmit- ter encoding process. Here we have collapsed the traditional radio transmitter functions including FEC, modulation, and MIMO pre-coding all into a single encoder block which is learned end-to-end with the channel and decoding processes.

We can structure the architecture here such that our random channel state, H is passed to both the channel impairment model (the complex multiply) as well as into the encoder module, simply by concatenating it with the symbol to transmit s. Timothy J. O’Shea Chapter 3. Learning to Communicate 82

Figure 3.25: Closed Loop MIMO Learning Autoencoder Architecture

Training such a system, we can compare to a variety of baseline methods such as zero

forcing (ZF) or minimum mean square error (MMSE) methods for pre-coding. In this case,

we consider the case where mt = 2 and mr = 2, which is the common 2x2 MIMO Case used widely in LTE and other systems, but still a relatively small scale MIMO system.

2x2 Scheme Performance with Perfect CSI

10−1

10−2

10−3

10−4 2x2 AE P-CSI

Bit Error Rate (BER) 2x2 Baseline 5 0 5 10 15 20 25 30 − Signal to Noise Radio (dB)

Figure 3.26: Error Rate Performance of Learned 2x2 Scheme (Perfect CSI).

In this case, the learned scheme compares quite favorably to the baseline method. We see

roughly a 5dB improvement at a bit error rate (BER) of 10−2 and a 10dB improvement at a Timothy J. O’Shea Chapter 3. Learning to Communicate 83

Figure 3.27: Closed Loop MIMO Autoencoder with Quantized Feedback

BER of 10−3, both substantial. Of course the baseline could improve significantly with the

introduction of error correction, but would have to give up some amount of information

rate to do so, making the learned system extremely appealing.

However, in the real world, MIMO systems can not and do not transmit real-valued chan-

nel estimates (Hˆ ) over the air (e.g. between eNodeBs and UEs). Instead they typically must minimize protocol overhead used for channel quality information (CQI)/CSI feed- back, which has led to the adoption of techniques like p-bit codebooks which contain

compact discrete valued codes indicating distinct channel modes.

Considering this task of compact discrete valued CQI feedback representation as part of

the end-to-end communications system learning architecture, we can cast the problem as

shown in figure 3.27. Here, we introduce a discretization network (dis(H)), which encodes

the real valued channel estimate Hˆ (H is used in our work without estimation error), into

a v-bit discrete value with one-hot encoding over 2v possible channel modes. This one-hot Timothy J. O’Shea Chapter 3. Learning to Communicate 84 encoding is then concatenated with s to form the MIMO encoder/modulator. This is quite exciting as we have now cast the entire end-to-end problem of compact CSI feedback, CSI- enhanced MIMO pre-coding, FEC encoding, modulation, over-the-air (OTA) representa- tion, MIMO combining, demodulation, and decoding all into one single learned model which jointly optimizes for all of these free parameters to maximize capacity for any dif- ferentiable channel model.

Figure 3.28: Bit Error Rate Performance of Baseline ZF Method

Baseline 2x2 Scheme Performance with Quantized CSI

10−1

10−2 2x2 Baseline Perfect CSI

Bit Error Rate (BER) 2x2 Baseline 8-bit CSI 2x2 Baseline 4-bit CSI 2x2 Baseline 2-bit CSI 5 0 5 10 15 20 25 30 − Signal to Noise Radio (dB)

In figure 3.28 we illustrate the decline in performance when quantizing the real valued

H feedback values with the ZF 2x2 scheme. Here, real-values provide the best solution, and while 8-bit CSI does not provide significant degradation, 4-bit and 2-bit CSI modes are substantially degraded.

In stark contrast, we can easily train the autoencoder based system to learn a v-bit CSI Timothy J. O’Shea Chapter 3. Learning to Communicate 85

Figure 3.29: Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2 Closed-Loop Scheme with Quantized CSI

2x2 Scheme Performance With Quantized CSI 100

10−1

10−2

10−3 2x2 AE 1 Bit 2x2 AE 2 Bit 10−4

Bit Error Rate (BER) 2x2 AE 4 Bit 2x2 AE 8 Bit 10−5 2x2 AE P-CSI 6 4 2 0 2 4 6 8 10 12 14 16 18 20 22 24 − − − Signal to Noise Radio (dB) feedback mode which attempts to be optimal for any positive non-zero value of v. Figure

3.29 illustrates the performance curves of 1-Bit, 2-Bit, 4-Bit, and 8-Bit CSI feedback modes, alongside perfect-CSI, the real-valued CSI feedback mode.

Interestingly, we obtain the best performance from a 2-bit feedback mode rather than larger numbers of bits or continuous valued feedback. This is likely because, for 2-bit feedback, we have enough to effectively generate a 4 entry code-book, whereas 1-bit is insufficient for the number of codebook modes required, and greater numbers of bits or continuous valued feedback requires the encoder to learn a more complex manifold of different or continuously varying encoder modes, which is made significantly simpler and more rapidly trained for a small but sufficient number of bits (e.g. v = 2). Timothy J. O’Shea Chapter 3. Learning to Communicate 86

Figure 3.30: Learned 2x2 Scheme 1 bit CSI Figure 3.31: Learned 2x2 Scheme 1-bit CSI Random Channels. All-Ones Channel.

Figure 3.32: Learned 2x2 Scheme 2-bit CSI Figure 3.33: Learned 2x2 Scheme 2-bit CSI Random Channels. All-Ones Channel.

Inspecting the learned constellations for the 1-bit and 2-bit CSI feedback MIMO channel autoencoders under random channel conditions, and under even-power per channel path

(all 1’s) assumptions for the H matrix, in figures 3.30, 3.31, 3.32, and 3.33 we can see that our best performing 2-bit scheme learns a set of non-standard 16-QAM transmit constel- lations which combine to form a relatively constant modulus non-standard PSK kind of ring arrangement at the receiver.

The system in figure 3.27 can be easily produced in simulation, where knowledge of H is Timothy J. O’Shea Chapter 3. Learning to Communicate 87

Figure 3.34: Deployment Configuration for Quantized MIMO Autoencoder

free, however in a real world system, such a trained system would need to be deployed

such as given in figure 3.34, where an estimate Hˆ is produced at the receiver, and used to form a discrete v-bit embedding to feed back to the encoder. This feedback could be included digitally coded messages within a higher level media access control (MAC) pro- tocol.

3.6 System Identification Over the Air

The key problem with this approach and use in over the air systems, is that we have relied on having a closed-form differentiable model for the channel during training. This is an ok assumption, if you can build such a thing, but in the real world it may be difficult to do so when faced with complex impairment distributions over a range of difference channel Timothy J. O’Shea Chapter 3. Learning to Communicate 88 effects. Very recent published work realizing such a system over the air [125] addresses this problem by only fine-tuning the receiver/decoder half of the channel autoencoder using error feedback from OTA data. This is a partial solution, but it does not allow the encoder or over the air representation to update to optimize for the real over the air impairments.

In general this is still an open system identification [126] problem in which we desire to

fit a function to the OTA data permutation which is occurring in the wireless channel.

This is an important area of research when combined with the channel autoencoder to allow the systems to truly adapt under heavy real world impairments. By approximat- ing the transfer function in a way that its gradient can be computed or approximated accurately, we can continue to train such systems end-to-end with a black box physical transform in the middle. Our future work and prototype systems will seek to solve this problem thoroughly in order to fully realize the power of channel autoencoders in the real world. Recent work is beginning to mature the approach of gradient approximation and back-propagation for black-box functions [127] which holds significant promise for this problem. Chapter 4

Learning to Label the Radio Spectrum

Interpreting and labeling the radio spectrum is a critical building block on which count- less radio capabilities are built today, and will increasingly be built tomorrow. In its sim- plest form, wireless channel estimation consumes some form of radio signal in time or frequency and produces an estimate for some parameter of an emitted and impaired sig- nal. This is used in wireless synchronization to estimate time of arrival, digital symbol clock rates, carrier frequency and phase, as well as impulse response over the channel.

Larger scale radio labeling problems involve detection and identification of radio signal emissions, information about physical emitters, changes in channel propagation condi- tions, user access patterns, and countless other applications which may help inform spec- trum regulators, dynamic spectrum access systems, wireless cyber-intrusion detection and anomaly detection systems, or other spectrum monitoring applications.

89 Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 90

For many years radio data labeling problems have been treated as highly niche estimation tasks, where compact models of the emitter signal, compact (usually simplified) models for the wireless channel, and an analytical estimator derivation process are used to pro- duce some analytic estimator expression. This has gotten us extremely far in the radio and radio labeling domain, however it has several key drawbacks relating to insufficient model detail and unfavorably formed estimator algorithm forms. Radio signal models are often simplified when considered in the context of underlying data distributions, hardware impairments, and other distortions. Radio channel models are almost always simplified by assuming only-AWGN, or including only a simplified compact simplified fading model, often omitting other real world impairments. Estimator derivation no the other hand, often results in an analytically convenient small expression whose algorith- mic implementation may be considered or approximated later when considering efficient implementation on available compute hardware and/or instruction sets.

By leveraging deep learning based on large datasets for estimator and label learning, we hope to demonstrate in this chapter how estimators, while merely serving as approxi- mations, can often outperform the traditional way of doing things, by incorporating rich emitter and contextual information, rich and accurate channel models, and by forcing ap- proximations to take the form of highly efficient wide matrix operations which synthesize efficiently onto modern wide/concurrent compute platforms, ultimately improving accu- racy and sensitivity, reducing power, weight, and size requirements for resulting systems, and greatly reducing the amount of manual engineering time and cost required to obtain Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 91 good practical solutions to new estimation problems.

4.1 Learning Estimators from Data

Synchronization is the principal difficult task of any radio receiver or modem. Aligning time, frequency, phase, and impulse response correctly for a received signal enables opti- mal decoding of transmitted symbols and reception of digital transmissions. Two of the most widely used estimators in any communications system are the timing estimator and the carrier frequency estimator.

Traditionally maximum a posteriori (MAP), maximum likelihood estimation (MLE), and

MMSE estimators are widely used for estimation of CSI values. We consider the canon- ical task of timing and frequency recovery for a single carrier QPSK signal [128]. Here, a common approach to carrier frequency offset (CFO) estimation is an fast Fourier trans- form (FFT) based technique which estimates the frequency using a periodogram of the mth power of the received signal [129]. The frequency offset detected by this technique is then given by (4.1).

N−1 Fs X ∆fˆ = argmax rm[k]e−j2πkt/N (4.1) N m f | | · k=0  R R  sym f sym , − 2 ≤ ≤ 2 Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 92

where m is the modulation order, r(k) is the received sequence, Rsym is the symbol rate,

Fs is the sampling frequency, and N is the number of samples. The algorithm searches for

a frequency that maximizes the time average of the mth power of the received signal over   various frequencies in the range of Rsym f Rsym . Due to the algorithm operating − 2 ≤ ≤ 2 in the frequency domain, the center frequency offset manifests as the maximum peak in

the spectrum of rm(k). Fig. 4.1 shows an example cyclic spectrum for a QPSK signal with a 2500 Hz center frequency offset (and a baud rate of 100ksym/sec), where the peak indicates the center frequency offset for the burst.

Figure 4.1: CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset

We conduct timing offset estimation in the canonical way by using a matched filter on

the received sequence matched to a known preamble sequence. The time-offset which

maximizes the output of the matched filter’s convolution is then taken to be the time-

offset of the received signal. Matched filtering can be represented by (4.2) Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 93

k=∞ X y(k) = h[n k]r[k], (4.2) − k=−∞ where h[k] is the preamble sequence. The matched-filter is known as the optimal filter for maximizing detection sensitvity in terms of SNR in the presence of additive stochastic white noise.

Our approximate, learned approach relies instead on construction, training and evaluat- ing an ANN based on a representative dataset. When relying on learned estimators, much of work and difficulty lies in generating a dataset which accurately reflects the final us- age conditions desired for the estimator. In our case, we produce numerous examples of wireless emissions in complex baseband sampling with rich channel impairment effects which are designed to match the intended real world conditions the system will operate in. We associate target labels from ground truth for center frequency offset and timing error which are used to optimize the estimator.

To train an ANN model, we consider the minimization of MSE and log-cosine hyperbolic

(log-cosh) [130] and Huber loss functions (shown in table 3.2). The latter are known to have improved properties in robust learning, which may benefit such a regression learn- ing task on some datasets and tasks. In our initial experiments in this paper, we observe the best quantitative performance using the MSE loss function which we shall use for the remainder.

We search over a large range of model architectures using Adam [62] to perform gradi- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 94

Table 4.1: ANN Architecture Used for CFO Table 4.2: ANN Architecture Used for Estimation Timing Estimation

Layer Output dimensions Layer Output dimensions Input (nsamp,2) Input (2048,2) Conv1D + ReLU (variable,32) Conv1D + ReLU (511,32) AveragePooling1D (variable,32) Conv1D + ReLU (126,64) Conv1D + ReLU (variable,128) Conv1D + ReLU (30,128) Conv1D + ReLU (variable,256) Conv1D + ReLU (2,256) Linear 1 Dense + Linear (1)

ent descent to optimize each model parameters based on our training dataset. This is

done by computing a loss function (e.g. LMSE) and updating the weights of the neural network model iteratively using back-propagation of loss gradients. More information on the model search and selection process used is provided in chapter 5.3. This model search and optimization process ideally seeks a model of minimal computational com- plexity which achieves a satisfactory level of performance (the frontier of efficient models represents a trade-off between model complexity and accuracy).

The ANN architectures used for our performance evaluation are shown below, both are stacked convolutional neural networks with narrowing dimensions which map noisy high dimensional raw time series data down to a compact single valued regression out- put. In the case of CFO estimation architecture shown in Table 4.1, we find that an average pooling layer works well to help improve performance and generalization of the initial layer feature maps, while in the timing estimation architecture in table 4.2 no-pooling, or max-pooling tends to work better. This makes sense on an intuitive level as CFO is distill- ing all symbols received throughout the input into a best frequency estimate, while timing Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 95 in a traditional matched filter sense, is derived typically from a maximum response at a single offset.

We generate two different sets of data for evaluating the performance of the two com- peting approaches. All generated data are based off of QPSK bursts with equiprobable independent and identically distributed (IID) symbols, and shaped with a square root root-raised cosine (RRC) filter with a roll-off β = 0.25 and a filter span of 6, and sampled at 400 kHz with a symbol rate of 100 kHz. We consider 4 channel conditions, AWGN with no fading, and three cases of Rayleigh fading with varying mean delay spreads in samples of σ = 0.5, 1, 2. Amplitude envelopes for a number of complex valued channel responses for each of these delay spreads are shown in figure 2.2 to provide some visual insight into the impact of Rayleigh fading effects at each of these delays. For the last case, inter-symbol interference (ISI) is present in the data.

The first dataset generated is the timing dataset, in which we prepended the burst with a known preamble of 64 symbols and random noise samples at the same SNR as the data portion of the burst. The number of noise samples prepended is drawn from a U ∼ (0, 1.25), in units of milliseconds. Additionally, a random phase offset drawn from a U ∼ (0, 2π) is introduced for each burst in the dataset.

The second dataset generated is the center frequency offset data, in which every example burst has a center frequency offset drawn from a ( 50e3, 50e3) distribution, in units U ∼ − of Hz. The bounds of this correspond to half the symbol rate, Rsym/2. Additionally, a random phase offset drawn from a (0, 2π) is introduced for each burst in the dataset. U ∼ Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 96

These datasets are generated for SNR’s of 0 dB, 5 dB, and 10 dB and for an AWGN chan- nel and three different Rayleigh fading channels with different mean delay spread values

(0.5, 1, and 2) representing different levels of reflection in a given wireless channel envi- ronment. We store the label of the timing offset and center frequency offsets as ground truth for training and evaluation.

For each dataset generated above we optimize network weights using Adam [62] for 100 epochs, reducing the initial learning rate of 1e 3 by a factor of two for each 10 epochs − with no reduction in validation loss, ultimately using the parameters corresponding to the epoch with the lowest validation loss. With the datasets generated above, we then compute the test error using a separate data partition between ground truth labels for timing and center frequency offset and predicted values generated using both expert and deep learning/ANN based estimators. The mean absolute error (MAE) of the estimator is used as our metric for comparison.

In the timing estimation comparison, we show estimator MAE results in figure 4.2, for each model AWGN(τ, χ) and Fading(τ, χ) where τ is the mean delay spread, and χ is the

SNR. Inspecting these results we can see that the traditional matched filter (MF)/MLE achieves excellent performance under the AWGN channel condition (AWGN channel model). We can see significant degradation of the MF/MLE baseline accuracy under the fading channel models however as a simple matched filter MLE timing estimation ap- proach has no ability to compensate for the expected range of channel delay spreads. In this case the artificial neural network / machine learning (ML/ANN) estimator approach Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 97

on average can not attain equivalent performance in all or even most cases. However, we

see that this approach does attain a MAE within the same order of magnitude, and does

in some fading cases achieve a lower MAE in the case of a fading channel.

Figure 4.2: Timing Estimation MAE Comparison

Quantitative results for estimation of center frequency offset error are shown in figures

4.3,4.4,4.5,4.6, summarizing the performance of both the baseline MLD method with dashed lines and the ML/ANN method with solid lines. We compare the mean absolute center frequency estimate error for each method at a range of different estimator block input length sizes. As moment based methods generally improve for longer block sizes, we compare performance over a range of short-time examples to longer-time examples.

In the AWGN case, in figure 4.3 we can see that for 5 and 10dB SNR cases, by the time we reach a block size of 1024 samples, the baseline estimator is doing quite well, and for larger block sizes (above 1024 samples) with SNR of at least 5dB, performance of the Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 98

Figure 4.3: Mean CFO Estimation Absolute Figure 4.4: Mean CFO Estimation Absolute Error for AWGN Channel Error (Fading σ=0.5)

CFO MAE under AWGN Channel CFO MAE under Light Fading 105

104

103 104 ML/ANN Estimator 0dB 2 10 ML/ANN Estimator 5dB ML/ANN Estimator 0dB ML/ANN Estimator 10dB ML/ANN Estimator 5dB 1 MAP Estimator 0dB 10 ML/ANN Estimator 10dB MAP Estimator 5dB MAP Estimator 0dB

Estimator MAE (Hz) MAP Estimator 10dB Estimator MAE (Hz) 102 103 102 103 Block Size (samples) Block Size (samples) baseline method is generally better. However, even in the AWGN case, for small block sizes we are able to achieve lower error using the ML/ANN approach, even at low SNR levels of near 0dB.

In the cases of fading channels shown in figures 4.4,4.5,4.6, we can see that performance of the baseline estimator degrades enormously from the AWGN case under which it was derived when delay spread is introduced. Performance gets perpetually worse as σ in- creases from 0.5 to 2 samples of mean delay spread. In the case of the ML/ANN estimator, we also see a degradation of estimator accuracy as delay spread increases, but the effect is not nearly as dramatic, ranging from 3.4 to 23254 Hz in the MLD case (almost a 7000x increase in error) versus a range of 2027 to 3305 Hz in the ML/ANN case (around a 1.6x increase in error).

From an accuracy standpoint, these results are quite interesting, we do not see significant Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 99

Figure 4.5: Mean CFO Estimation Absolute Figure 4.6: Mean CFO Estimation Absolute Error (Fading σ=1) Error (Fading σ=2)

CFO MAE under Medium Fading CFO MAE under Heavy Fading

4.5 10 104.5

4 10 4 ML/ANN Estimator 0dB 10 ML/ANN Estimator 5dB ML/ANN Estimator 0dB ML/ANN Estimator 10dB ML/ANN Estimator 5dB MAP Estimator 0dB 3.5 10 ML/ANN Estimator 10dB MAP Estimator 5dB MAP Estimator 0dB MAP Estimator 10dB Estimator MAE (Hz) Estimator MAE10 (Hz) 3.5 102 103 102 103 Block Size (samples) Block Size (samples) improvement in timing estimation here against a matched filter, however for frequency estimation, we see significant potential gains for both short-time estimators, and for esti- mation under heavily impaired fading channel environments where AWGN assumptions used during derivation fail. This result helps illustrate how often approximate data cen- tric learned models can outperform toy analytic solutions in cases where the simplified model assumptions do not hold and where the degrees of freedom are too high to allow for accurate and efficient closed form solutions.

4.2 Learning to Identify Modulation Types

One of the canonical tasks in radio estimation and detection, is that of radio signal mod- ulation identification. In radio sensing systems such as DSA systems [50], as well as in spectrum regulatory enforcement and other monitoring systems, signal modulation Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 100 identification is often the first step towards identifying the emitter or protocol used by an emitter, and being able to communicate with or monitor it. This task can be treated simply as a classification problem among possible transmission modes (although this is a simplification of the possible hierarchical classification problem among emitter param- eters). Significant literature exists into prior methods for performing radio signal type classification when using analytically derived deicions boundaries as well as compact learned decision criterion with previous methods for machine learning such as decision trees (DTrees) or SVMs.

Our early work in this area, conducted in 2015 and first published publicly in 2016 [131] has received significant attention, spurring international interest, numerous derivative and related works works at the IEEE DySpan 2017 Mod-Rec workshop [132, 133, 134,

135, 136, 137, 138] and elsewhere, DARPA’s RF Machine Learning Systems Program,

DARPA’s Battle-of-the-ModRecs Challenges, and parts of the DARPA Spectrum Collab- oration Challenge (SC2), along with spurring internal research programs at numerous companies.

Our basic approach relying on end-to-end feature learning on raw In-phase and Quadra- ture (I/Q) data remains the same, but a number of techniques and methods have been improved upon since the orignal paper [131], which lead to significant improvements in detection sensitivity, power efficiency, and generality of such systems. Numerous draw- backs with this approach however, can not be taken for granted. The need for labeled data, robust and realistic datasets, and comprehensive metrics for comparison can not be Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 101 overstated, and these often limit the performance attainable for a given problem. Initial attempts to address these needs by open sourcing classifiers classifiers, datasets/generators

[139], and metrics/scores were welcomed by a few, but have not been heavily adopted or contributed to by many publishing in the field. The radio signal processing community still has a long way to go to embrace data science in the way that has become the norm in computer vision and many other disciplines. High quality public datasets from more high profile institutions such as DARPA or NSF would be significant help in facilitating this some day.

4.2.1 Expert Features for Modulation Recognition (Baseline)

Modulation recognition has long been used as a toy problem in the radio estimation and detection world [140, 141, 142, 15, 143, 144, 138]. It sees some usage in spectrum mon- itoring applications, but is not widely deployed or neccesary in many widely deployed communications systesm.

Early work on this problem relies on analytically derived statistics and decision thresh- olds typically derived probabilistically from a simplified analytic signal model (we refer to these as expert methods [e.g. written explicitly by an expert in the domain]). Figure 4.7

(from [15]) illustrates one such traditional modulation recognition process for a digitally modulated radio signal. Here a series of statistics (vn) are compared to a series of analyti- cally derived decision thresholds (ηn), and a rigid analyticly formed decision tree is used Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 102

Figure 4.7: Traditional Approach to Modulation Recognition, from [15]

to make a modulation recognition decision.

For our baseline features in this work, we leverage a number of compact higher order statistics (HOSs). To obtain these we compute the higher order moments (HOMs) using the expression given below:

M(p, q) = E[xp−q(x∗)q] (4.3)

From these HOMs we can derive a number of higher order cumulantss (HOCs) which have been shown to be effective discriminators for many modulation types [145]. HOCs can be computed combinatorially using HOMs, each expression varying slightly; below we show one example such expression for the C(4, 0) HOM.

q C(4, 0) = M(4, 0) 3 M (2, 0)2 (4.4) − × Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 103

Additionally we consider a number of analog features which capture other statistical be- haviors which can be useful, these include mean, standard deviation and kurtosis of the normalized centered amplitude, the centered phase, instantaneous frequency, absolute normalized instantaneous frequency, and several others which have shown to be useful in prior work. [146].

Machine learning is also considered for the decision making based on these sets of fea- tures. SVM and DTree are two commonly used methods which can be trained on the low-dimensional feature space in order to derive an optimized set of decision criteria.

Prior work has generally used machine learning and pattern recognition on simpler sets of features such as those described above. However, results have also been shown using the increased complexity features such as the auto-correlation function (ACF), the SCF or the α-profile (a one dimensional cut of the SCF) [147]. In our case, we compare instead to the full dimensional input samples withour imparting expert design about what form features should take.

4.2.2 Time series Modulation Classification With CNNs

CNN layers have a very nice property in that layer parameters (weights) correspond to specific filters or kernels which are evaluated at regular shift intervals across the input values, limiting the parameter count while enforcing weight re-use at time shifts. This key feature is well suited to any input domain where translation invariance is appropri- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 104 ate. In imagery, learning arbitrary 2D shifts of where an object occurs in an image’s X and Y axes can be greatly simplified, by ensuring that the same feature weights are used to form activaitons at all shifts in the input using a convolutional layer. This property is also extremely similar to the properties of linear time invariant (LTI) systems which are widely used to model radio communications systems as 1D time series constructs. Be- cause radio signals may arrive with random time offsets and consist of primitive objects such as symbols which occur randomly in time to form a hierarchical structure, CNNs are well suited to learning low level time-domain features or basis function for represent- ing them. In fact, we already know and use this structure heavily in communications, as we have used matched filters for preamble detection, symbol detections and decisions, and many other purposes throughout the history of communications. The primary dif- ferences then are that we optimize filter weights durrign the training process, rather than using pre-defined weights, we often use large hierarchies of multiple convolutional lay- ers, and these layers often have many different filter channels operating in paralel to form higher feature-space representations.

Building upon key trends discussed in more depth in chapter 1.3, the raw CNN approach to modulation recognition leverages the relatively recent abilities of training algorithms, network architectures, and computational platforms to directly train using an end-to-end feature learning approach on high dimensional raw radio time series as an alternative to trying to pre-engineer specific features such as statistical moments, cyclic moments, or other manually derived distillations of information. In both of [131, 111] we explore this Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 105

Table 4.3: Layout for our 10 modulation CNN modulation classifier

Layer Output dimensions Input 2 128 × Convolution (128 filters, size 2 8) + ReLU 128 121 × × Max Pooling (size 2, strides 2) 128 60 × Convolution (64 filters, size 1 16) + ReLU 64 45 × × Max Pooling (size 2, strides 2) 64 22 × Flatten 1408 Dense + ReLU 128 Dense + ReLU 64 Dense + ReLU 32 Dense + softmax 10 approach in depth. Here we rely on convolutional neural network on time series data to learn a deep net with capable of performing robust classification of radio modulation types with random data.

The only pre-processing used, is to ensure zero mean and unit variance of the raw signal input vector, to ensure examples are nicely scaled to facilitate learning. In some cases, we only enforce unit variance since certain classes are only differentiated by their mean shift

(e.g. analog modulations with and without a carrier at DC).

As is widely done for image classification, we adopt a narrowing series of convolutional layers followed by dense/fully-connected layers and terminated with a dense softmax layer for our classifier (similar to a VGG architecture [148]). The dataset1 for this bench- mark consists of 1.2 M sequences of 128 complex-valued baseband I/Qsamples corre- sponding to ten different digital and analog single-carrier modulation schemes (amplitude modulation (AM), frequency modulation (FM), PSK, QAM, etc.) that have gone through

1RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/ Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 106 a wireless channel with harsh impairments including multi-path fading and both clock and carrier rate offset [131]. The samples are taken at 20 different SNR within the range from 20 dB to 18 dB. − Figure 4.8: 10 Modulation CNN performance comparison of accuracy vs SNR

1

0.8

0.6

0.4 CNN Boosted Tree 0.2 Single Tree Correct classification probability Random Guessing 0 20 10 0 10 − − SNR

In Fig. 4.8, we compare the classification accuracy of the CNN against that of extreme gradient boosting with 1000 estimators, as well as a single scikit-learn decision tree [149], operating on a mix of 16 analog and cumulant expert features as proposed in [146] and

[145]. The short-time nature of the examples places this task on the difficult end of the modulation classification spectrum since we cannot compute expert features with high stability over long periods of time. The CNN outperforms the boosted feature-based classifier by around 4 dB in the low to medium SNR range while the performance at high Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 107

Figure 4.9: Confusion matrix of the CNN (SNR = 10 dB)

1.0 8PSK

AM-DSB 0.8 BPSK

CPFSK 0.6 GFSK

PAM4 0.4 Ground truth QAM16

QAM64 0.2 QPSK

WBFM 0.0

8PSK BPSK GFSK CPFSK PAM4 QPSK WBFM AM-DSB QAM16 QAM64 Prediction

SNR is similar. Performance in the single tree case is about 6 dB worse than the CNN at medium SNR and 3.5 % worse at high SNR.

Fig. 4.9 shows the confusion matrix for the CNN at SNR = 10 dB, revealing confusing cases between QAM16 and QAM64 and between Wideband FM (WBFM) and double- sideband AM (AM-DSB). Despite the high SNR, classification is imperfect due to several other impairments as described above. The distinction between AM-DSB and WBFM is additionally complicated by the small observation window (0.64 ms of modulated speech per example) and low information rate with frequent silence between words. Discrimi- nating between QAM16 and QAM64 also suffers from short-time observations over only a few symbols since constellations are higher order and share common points. The accu- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 108 racy of the feature-based classifier saturates at high SNR for the same reasons, and neither classifier reaches a perfect score on this dataset. In [150], the authors report on a success- ful application of a similar CNN for the detection of black hole mergers in astrophysics from noisy time-series data.

4.2.3 Deep Residual Network Time-series Modulation Classification

Since the publication of our original work [131, 111] in CNN based signal identification work desribed in the previous section, numerous advances have been made in neural network architecture with significant implications towards structuring CNN solutions for the modulation recognition problem. Key among these are residual networks [4], batch normalization [41], self-normalizing networks [73], and the used of deep dilated convo- lutional architectures [2], and others. In this section, we detail updated results leverag- ing these techniques, considering performance over the air, and improving our synthetic dataset performance, while providing performance trade-off comparisons detailing the impact of a number of factors.

Dataset Structure and Improvements

Dataset related issues became clear from the dataset in [139] and prior datasets, that streaming models with coherent channel impairments were not appropriate for training.

Randomly sampling many samples with independent channel state, rather than adjacent Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 109

Table 4.4: Random Variable Initialization

Random Variable Distribution α U(0.1, 0.4) ∆t U(0, 16) ∆fs N(0, σclk) θc U(0, 2π) ∆fc N(0, σclk) H Σiδ(t Rayleigh (τ)) − i correlated channel state provided significant gain and realism for the problem. To better characterize the distribution of the data, we introduce the random variables in table 4.4, each IID for every independent training example. The training data synthesis model is illustrated in figure 4.10.

Figure 4.10: System for modulation recognition dataset signal generation and synthetic channel impairment modeling

We consider two different compositions of the dataset, first a “Normal” dataset, which consists of 11 classes which are all relatively low information density and are commonly seen in impaired environments. These 11 signals represent a relatively simple classifi- cation task at high SNR in most cases, somewhat comparable to the canonical MNIST digits. Second, we introduce a “Difficult” dataset, which contains all 24 modulations.

These include a number of high order modulations (QAM256 and APSK256), which are used in the real world in very high-SNR low-fading channel environments such as on line of sight (LOS) impulsive satellite links [151] (e.g. DVB-S2X). We however, apply impair- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 110 ments which are beyond that which you would expect to see in such a scenario and con- sider only relatively short-time observation windows for classification, where the number of samples, ` = 1024. Short time classification is a hard problem since decision processes can not wait and acquire more data to increase certainty. This is the case in many real world systems when dealing with short observations (such as when rapidly scanning a receiver) or short signal bursts in the environment. Under these effects, with low SNR examples (from -20 dB to +30 dB Es/N0), one would not expect to be able to achieve any- where near 100% classification rates on the full dataset, making it a good benchmark for comparison and future research comparison.

The specific modulations considered within each of these two dataset types are as follows:

Normal Classes: OOK, 4ASK, BPSK, QPSK, 8PSK, 16QAM, AM-SSB-SC, AM-DSB- • SC, FM, GMSK, OQPSK

Difficult Classes: OOK, 4ASK, 8ASK, BPSK, QPSK, 8PSK, 16PSK, 32PSK, 16APSK, • 32APSK, 64APSK, 128APSK, 16QAM, 32QAM, 64QAM, 128QAM, 256QAM, AM-

SSB-WC, AM-SSB-SC, AM-DSB-WC, AM-DSB-SC, FM, GMSK, OQPSK

The raw datasets will be made available on the RadioML website 2 after publication.

2https://radioml.org Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 111

Over the air dataset generation

In additional to simulating wireless channel impairments, we also implement an OTA test-bed in which we modulate and transmit signals using a USRP [152] B210 SDR. We use a second B210 (with a separate free-running local oscillator (LO)) to receive these transmissions in the lab, over a relatively benign indoor wireless channel on the 900MHz

ISM band. These radios use the Analog Devices AD9361 [153] radio frequency integrated circuit (RFIC) as their radio front-end and have an LO that provides a frequency (and clock) stability of around 2 parts per million (PPM). We off-tune our signal by around 1

MHz to avoid DC signal impairment associated with direct conversion, but store signals at base-band (offset only by LO error). Received test emissions are stored off unmodified along with ground truth labels for the modulation from the emitter. Figure 4.11 illustrates the hardware recording architecture used for our data capture, and the picture in figure

4.12 illustrates the actual hardware used for data capture, training and evaluation.

Baseline classification approach

Our baseline method leverages the list of HOMs and other aggregate signal behavior statistics given in table 4.5. Here we can compute each of these statistics over each 1024 sample example, and translate the example into feature space, a set of real values asso- ciated with each statistic for the example. This new representation has reduced the di- mension of each example from R1024∗2 to R28, making the classification task much simpler Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 112

Figure 4.12: Picture of over the air lab capture and training system Figure 4.11: Over the air capture system diagram

Table 4.5: Features Used

Feature Name M(2,0), M(2,1) M(4,0), M(4,1), M(4,2), M(4,3) M(6,0), M(6,1), M(6,2), M(6,3) C(2,0), C(2,1) C(4,0), C(4,1), C(4,2), C(6,0), C(6,1), C(6,2), C(6,3) Additional analog 4.2.1 but also discarding the vast majority of the data. We use an ensemble model of gradient boosted trees (XGBoost) [154] to classify modulations from these features, which outper- forms a single decision tree or SVM significantly on the task. (We additionally evaluated methods including SVM [32], Naive Bayes, k-Nearest Neighbor, and deep neural net- work (DNN) on feature data in [131, 111], but ultimately XGBoost offered the strongest performing feature-based classification approach which is why we focus on it here.) Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 113

Deep Learning based classification approaches

We evaluate and tune two classes of networks, first a VGG-style CNN using max-pooling shown in table 4.6, and second, a residual network leveraging dilated convolutions ap- propriate for time series radio signals and self-normalizing fully connected layers to map residual/CNN features to outputs, shown in table 4.7.

In [148], the question of how to structure such networks is explored, and several basic design principals for ”VGG” networks are introduced (e.g. filter size is minimized at 3x3, smallest size pooling operations are used at 2x2). Following this approach has generally led to straight forward way to construct CNNs with good performance. We adapt the

VGG architecture principals to a 1D CNN, improving upon the similar networks in [131,

111]. This represents a simple DL CNN design approach which can be readily trained and deployed to effectively accomplish many small radio signal classification tasks.

Figure 4.13: Example graphic of high level feature learning based residual network architecture for modulation recognition Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 114

As network algorithms and architectures have improved since Alexnet, they have made the effective training of deeper networks using more and wider layers possible, and lead- ing to improved performance. In the computer vision space, the idea of deep residual networks has become increasingly effective [4]. In a deep residual network, as is shown in figure 4.20, the notion of skip or bypass connections is used heavily, allowing for fea- tures to operate at multiple scales and depths through the network. This has led to signif- icant improvements in computer vision performance, and has also been used effectively on time-series audio data [2]. In [155], the use of residual networks for time-series radio classification is investigated, and seen to train in fewer epochs, but not to provide signif- icant performance improvements in terms of classification accuracy. We revisit the prob- lem of modulation recognition with a modified residual network and obtain improved performance when compared to the CNN on this dataset, a high level depiction of this architecture is shown in figure 4.13. The basic residual unit and stack of residual units is shown in figure 4.20, while the complete network architecture for our best architecture for

(` = 1024) is shown in table 4.7. We also employ self-normalizing neural networks [73] in the fully connected region of the network, employing the SELU activation function

[73], mean-response scaled initializations (MRSA) [156], and Alpha Dropout [73], which provides a slight improvement over conventional ReLU performance.

Significant tuning time was spent optimizing both networks, and a collection of different trade studies are shown below. A thorough analysis of all of the hundreds (or limitless) network architecture design choices possible is difficult to address in this same depth. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 115

Table 4.6: CNN Network Layout

Layer Output dimensions Input 2 1024 × Conv 64 1024 × Table 4.7: ResNet Network Layout Max Pool 64 512 × Conv 64 512 Layer Output dimensions × Max Pool 64 256 Input 2 1024 × × Conv 64 256 Residual Stack 32 512 × × Max Pool 64 128 Residual Stack 32 256 × × Conv 64 128 Residual Stack 32 128 × × Max Pool 64 64 Residual Stack 32 64 × × Conv 64 64 Residual Stack 32 32 × × Max Pool 64 32 Residual Stack 32 16 × × Conv 64 32 FC/SeLU 128 × Max Pool 64 16 FC/SeLU 128 × Conv 64 16 FC/Softmax 24 × Max Pool 64 8 × FC/SeLU 128 FC/SeLU 128 FC/Softmax 24

However, the architecture tuning process is revisited again in more depth later in chapter

5.3, where we consider dealing with the model hyper-parameter design choices using a secondary optimization process.

Figure 4.14: Complex time domain examples of 24 modulations from the dataset at simulated 10dB Eb/N0 and ` = 256 Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 116

Figure 4.15: Complex time domain examples of 24 modulations over the air at high SNR and ` = 256

Figure 4.16: Complex constellation examples of 24 modulations from the dataset at simulated 10dB Eb/N0 and ` = 256

We show a number of examples from both the synthetic and and over the air datasets for a bit of dataset intuition about what each example looks like at differing SNR levels, and how similar classes appear at lower SNR. Each example is 1024 complex valued samples at 1 MSamp/sec with a baud rate of 200Ksym/sec. We show time domain examples for all 24 classes, where figures 4.14 and 4.17 illustrate time domain signals at 10dB and 0dB respectively. Figure 4.15 illustrates an OTA capture of the dataset with relatively high Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 117

Figure 4.17: Complex time domain examples of 24 modulations from the dataset at simulated 0dB Eb/N0 and ` = 256

SNR, and figure 4.16 illustrates the 10dB SNR synthetic dataset in the complex plane, to provide an alternate perspective on the complex valued trajectories through modulation symbol points.

Classification on low-order modulations

We first compare performance on the lower difficulty dataset on lower order modulation types. Training on a dataset of 1 million example, each 1024 samples long, we obtain excellent performance at high SNR for both the VGG CNN and the ResNet (RN) CNN.

In this case, the ResNet achieves roughly 5 dB higher sensitivity for equivalent classifi- cation accuracy than the baseline, and at high SNR a maximum classification accuracy rate of 99.8% is achieved by the ResNet, while the VGG network achieves 98.3% and the baseline method achieves a 94.6% accuracy. At lower SNRs, performance between VGG and ResNet networks are virtually identical, but at high-SNR performance improves con- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 118

Figure 4.18: 11-Modulation normal dataset performance comparison (N=1M)

1 Baseline VGG/CNN 0.8 ResNet

0.6

0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB] siderably using the ResNet and obtaining almost perfect classification accuracy.

For the remainder of this chapter, we will consider the much harder task of 24 class high order modulations containing higher information rates and much more easily confused classes between multiple high order PSKs, APSKs and QAMs.

Classification under AWGN

Signal classification under AWGN is the canonical problem which has been explored for many years in communications literature. It is a simple starting point, and it is the con- dition under which analytic feature extractors should generally perform their best (since they were derived under these conditions). In figure 4.19 we compare the performance of the ResNet (RN), VGG network, and the baseline (BL) method on our full dataset for Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 119

Figure 4.19: 24-Modulation difficult dataset performance comparison (N=240k)

1 BL AWGN RN AWGN 0.8 VGG AWGN

0.6

0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB]

` = 1024 samples, N = 239, 616 examples, and L = 6 residual stacks. Here, the residual network provides the best performance at both high and low SNRs on the difficult dataset by a margin of 2-6 dB in improved sensitivity for equivalent classification accuracy. Here,

N indicates the number of examples in the dataset, ` indicates the number of samples of input per example, and L indicates the number of residual stacks included in the network

(where a single residual stack architecture is shown in figure 4.20).

Classification under Impairments

In any real world scenario, wireless signals are impaired by a number of effects. While

AWGN is widely used in simulation and modeling, the effects of fading, carrier offset, and clock offset are present almost universally in wireless systems. It is interesting to inspect how well this class of learned classifiers perform under such impairments and compare Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 120

Figure 4.20: Residual unit and residual stack architectures

their rate of degradation under impairments with that of more traditional approaches to

signal classification.

In figure 4.21 we plot the performance of the residual network based classifier under each

considered impairment model. This includes AWGN, minor LO offset (σclk = 0.0001),

moderate LO offset (σclk = 0.01), and several fading models ranging from minor (τ = 0.5) to harsh (τ = 4.0). Under all fading models, minor LO offset is assumed as well. Interest- ingly in this plot, ResNet performance improves under LO offset rather than degrading.

Additional LO offset which results in spinning or dilated versions of the original sig- nal, appears to have a positive regularizing effect on the learning process which provides quite a noticeable improvement in performance. At high SNR performance ranges from around 80% in the best case down to about 59% in the worst case.

In figure 4.22 we show the degradation of the baseline classifier under impairments. In this case, LO offset never helps, but the performance instead degrades with both LO offset Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 121

Figure 4.21: Resnet performance under various channel impairments (N=240k)

1 RN AWGN RN σclk = 0.01 0.8 RN σclk = 0.0001 RN τ = 0.5 0.6 RN τ = 1 RN τ = 2 RN τ = 4 0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB]

Figure 4.22: Baseline performance under channel impairments (N=240k)

1 BL AWGN BL σclk = 0.01 0.8 BL σclk = 0.0001 BL τ = 0.5 0.6 BL τ = 1 BL τ = 2 BL τ = 4 0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB] Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 122

Figure 4.23: Comparison models under LO impairment

1 BL σclk = 0.01 RN σclk = 0.01 0.8 VGG σclk = 0.01

0.6

0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB] and fading effects, in the best case at high SNR this method obtains about 61% accuracy while in the worst case it degrades to around 45% accuracy.

Directly comparing the performance of each model under moderate LO impairment ef- fects, in figure 4.23 we show that for many real world systems with unsynchronized LOs and Doppler frequency offset there is nearly a 6dB performance advantage of the ResNet approach vs the baseline, and a 20% accuracy increase at high SNR. In this section, all models are trained using N = 239, 616 and ` = 1024 for this comparison.

Classifier performance by network depth

Model size can have a significant impact on the ability of large neural network models to accurately represent complex features. In computer vision, convolutional layer based Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 123

1 L=1 L=2 0.8 L=3 L=4 0.6 L=5 L=6 0.4

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB]

Figure 4.24: ResNet performance vs depth (L = number of residual stacks)

DL models for the ImageNet dataset started around 10 layers deep, but modern state of

the art networks on ImageNet are often over 100 layers deep [157], and more recently

even over 200 layers. Initial investigations of deeper networks in [155] did not show

significant gains from such large architectures, but with use of deep residual networks

on this larger dataset, we begin to see quite a benefit to additional depth. This is likely

due to the significantly larger number of examples and classes used. In figure 4.24 we

show the increasing validation accuracy of deep residual networks as we introduce more

residual stack units within the network architecture (i.e. making the network deeper). We

see that performance steadily increases with depth in this case with diminishing returns

as we approach around 6 layers. When considering all of the primitive layers within this

network, when L = 6 we the ResNet has 121 layers and 229k trainable parameters, when

L = 0 it has 25 layers and 2.1M trainable parameters. Results are shown for N = 239, 616 Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 124

Figure 4.25: Modrec performance vs modulation type (Resnet on synthetic data with N=1M, σclk=0.0001)

1 OOK 0.9 4ASK 8ASK 0.8 BPSK QPSK 0.7 8PSK 16PSK 0.6 32PSK 16APSK 0.5 32APSK 64APSK 0.4 128APSK 16QAM 32QAM 0.3 64QAM

Correct classification probability 128QAM 0.2 256QAM AM-SSB-WC 0.1 AM-SSB-SC AM-DSB-WC 0 20 15 10 5 0 5 10 15 AM-DSB-SC − − − − FM Signal to noise ratio (Es/N0) [dB] GMSK OQPSK and ` = 1024.

Classification performance by modulation type

In figure 4.25 we show the performance of the classifier for individual modulation types.

Detection performance of each modulation type varies drastically over about 18dB of

SNR. Some signals with lower information rates and vastly different structure such as AM Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 125

Figure 4.26: 24-modulation confusion matrix for ResNet trained and tested on synthetic dataset with N=1M, AWGN, and SNR 0dB ≥ and FM analog modulations are much more readily identified at low SNR, while high- order modulations require higher SNRs for robust performance and never reach perfect classification rates. However, all modulation types reach rates above 80% accuracy by around 10dB SNR. In figure 4.26 we show a confusion matrix for the classifier across all 24 classes for AWGN validation examples where SNR is greater than or equal to zero. We can see again here that the largest sources of error are between high order PSK (16/32-PSK), between high order QAM (64/128/256-QAM), as well as between AM modes (confusing with-carrier (WC) and suppressed-carrier (SC)). This is largely to be expected as for short Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 126

Figure 4.27: Performance vs training set size (N) with ` = 1024

1 N=1k N=2k 0.8 N=4k N=8k N=15k 0.6 N=31k N=62k N=125k N=250k 0.4 N=500k N=1M N=2M 0.2 Correct classification probability

0 20 18 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 16 18 − − − − − − − − − − Es/N0 [dB]

time observations, and under noisy observations, high order QAM and PSK modes can

be extremely difficult to tell apart through any approach.

Classifier Training Size Requirements

When using data-centric machine learning methods, the dataset often has an enormous

impact on the quality of the model learned. We consider the influence of the number

of example signals in the training set, N, as well as the time-length of each individual example in number of samples, `.

In figure 4.27 we show how performance of the resulting model changes based on the total Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 127

Figure 4.28: 24-modulation confusion matrix for ResNet trained and tested on synthetic dataset with N=1M and σclk = 0.0001 number of training examples used. Here we see that dataset size has a dramatic impact on model training, high SNR classification accuracy is near random until 4-8k examples and improves 5-20% with each doubling until around 1M. These results illustrate that having sufficient training data is critical for performance. For the largest case, with 2 million examples, training on a single state of the art Nvidia V100 GPU (with approximately

125 tera-floating point operations per second (FLOPS)) takes around 16 hours to reach a stopping point, making significant experimentation at these dataset sizes cumbersome.

We do not see significant improvement going from 1M to 2M examples, indicating a point Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 128

Figure 4.29: Performance vs example length in samples (`)

1 `=16 `=32 0.8 `=64 `=128 0.6 `=256 `=512 `=768 0.4 `=1024

0.2

Correct classification probability 0 20 15 10 5 0 5 10 15 − − − − Es/N0 [dB]

of diminishing returns for number of examples around 1M with this configuration. With

either 1M or 2M examples we obtain roughly 95% test set accuracy at high SNR. The

class-confusion matrix for the best performing mode with `=1024 and N=1M is shown

in figure 4.28 for test examples at or above 0dB SNR, in all instances here we use the

σclk = 0.0001 dataset, which yeilds slightly better performance than AWGN.

Figure 4.29 shows how the model performance varies by window size, or the number of

time-samples per example used for a single classification. Here we obtain approximately

a 3% accuracy improvement for each doubling of the input size (with N=240k), with sig-

nificant diminishing returns once we reach ` = 512 or ` = 1024. We find that CNNs scale very well up to this 512-1024 size, but may need additional scaling strategies thereafter for larger input windows simply due to memory requirements, training time requirements, and dataset requirements. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 129

Over the air performance

We generate 1.44M examples of the 24 modulation dataset over the air using the USRP setup described above. Using a partition of 80% training and 20% test, we can directly train a ResNet for classification. Doing so on an Nvidia V100 in around 14 hours, we obtain a 95.6% test set accuracy on the over the air dataset, where all examples are roughly

10dB SNR. A confusion matrix for this OTA test set performance based on direct training is shown in figure 4.30.

Figure 4.30: 24-modulation confusion matrix for ResNet trained and tested on OTA examples with SNR 10 dB ∼ Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 130

Figure 4.31: Resnet transfer learning OTA performance

0.9

0.85

0.8

0.75 AWGN 0.7 σclk=0.0001 σclk=0.01 0.65 τ = 0.5 τ = 1.0 0.6 0 5 10 15 20 25 30 35 40 45 50 Correct classification probability (Test Set) Transfer Learning Epochs

Transfer Learning to Over-the-air Performance

We also consider over the air signal classification as a transfer learning problem, where the model is trained on synthetic data and then only evaluated and/or fine-tuned on

OTA data. Because full model training can take hours on a high end GPU and typi- cally requires a large dataset to be effective, transfer learning is a convenient alternative for leveraging existing models and updating them on smaller computational platforms and target datasets. We consider transfer learning, where we freeze network parameter weights for all layers except the last several fully connected layers (last three layers from table 4.7) in our network when while updating. This is commonly done today with com- puter vision models where it is common start by using pre-trained VGG or other model weights for ImageNet or similar datasets and perform transfer learning using another Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 131 dataset or set of classes. In this case, many low-level features work well for different classes or datasets, and do not need to change during fine tuning. In our case, we con- sider several cases where we start with models trained on simulated wireless impairment models using residual networks and then evaluate them on OTA examples. The accura- cies of our initial models (trained with N=1M) on synthetic data shown in figure 4.21, and these ranged from 84% to 96% on the hard 24-class dataset. Evaluating performance of these models on OTA data, without any model updates, we obtain classification accura- cies between 64% and 80%. By fine-tuning the last two layers of these models on the OTA data using transfer learning, we and can recover approximately 10% of additional accu- racy. The validation accuracies are shown for this process in figure 4.31. These ResNet update epochs on dense layers for 120k examples take roughly 60 seconds on a Titan X card to execute instead of the full 500 seconds on V100 card per epoch when updating ∼ model weights.

Ultimately, the model trained on just moderate LO offset (σclk = 0.0001) performs the best on OTA data. The model obtained 94% accuracy on synthetic data, and drops roughly

7% accuracy when evaluating on OTA data, obtaining an accuracy of 87%. The primary confusion cases prior to training seem to be dealing with suppress or non-suppressed carrier analog signals, as well as the high order QAM and APSK modes.

This seems like it is perhaps the best suited among our models to match the OTA data.

Very small LO impairments are present in the data, the radios used had extremely stable oscillators present (GPSDO modules providing high stable 75˜ PPB clocks) over very short Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 132

Figure 4.32: 24-modulation confusion matrix for ResNet trained on synthetic σclk = 0.0001 and tested on OTA examples with SNR 10 dB (prior to fine-tuning) ∼ example lengths (1024 samples), and that the two radios were essentially right next to each other, providing a very clean impulsive direct path while any reflections from the surrounding room were likely significantly attenuated in comparison, making for a near impulsive channel. Training on harsher impairments seemed to degrade performance of the OTA data significantly.

We suspect as we evaluate the performance of the model under increasingly harsh real world scenarios, our transfer learning will favor synthetic models which are similarly impaired and most closely match the real wireless conditions (e.g. matching LO distribu- tions, matching fading distributions, etc). In this way, it will be important for this class of systems to train either directly on target signal environments, or on very good im- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 133

Figure 4.33: 24-modulation confusion matrix for ResNet trained on synthetic σclk = 0.0001 and tested on OTA examples with SNR 10 dB (after fine-tuning) ∼ pairment simulations of them under which well suited models can be derived. Possible mitigation to this are to include domain-matched attention mechanisms such as the ra- dio transformer network [139] in the network architecture to improve generalization to varying wireless propagation conditions.

Modulation Recognition Learning Analysis

We have extended prior work on using deep convolutional neural networks for radio sig- nal classification by heavily tuning deep residual networks for the same task. We have also conducted a much more thorough set of performance evaluations on how this type Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 134 of classifier performs over a wide range of design parameters, channel impairment con- ditions, and training dataset parameters. This residual network approach achieves state of the art modulation classification performance on a difficult new signal database both synthetically and in over the air performance. Other architectures still hold significant potential, radio transformer networks, recurrent units, and other approaches all still need to be adapted to the domain, tuned and quantitatively benchmarked against the same dataset in the future. Other works have explored these to some degree, but generally not with sufficient hyper-parameter optimization to be meaningful.

We have shown that, contrary to prior work, deep networks do provide significant per- formance gains for time-series radio signals where the need for such deep feature hier- archies was not apparent, and that residual networks are a highly effective way to build these structures where more traditional CNNs such as VGG struggle to achieve the same performance or make effective use of deep networks. We have also shown that simulated channel effects, especially moderate LO impairments improve the effect of transfer learn- ing to OTA signal evaluation performance, a topic which will require significant future investigation to optimize the synthetic impairment distributions used for training.

ADL methods continue to show enormous promise in improving radio signal identifi- cation sensitivity and accuracy, especially for short-time observations. We have shown deep networks to be increasingly effective when leveraging deep residual architectures and have shown that synthetically trained deep networks can be effectively transferred to over the air datasets with (in our case) a loss of around 7% accuracy or directly trained Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 135 effectively on OTA data if enough training data is available. While large well labeled datasets can often be difficult to obtain for such tasks today, and channel models can be difficult to match to real-world deployment conditions, we have quantified the real need to do so when training such systems and helped quantify the performance impact of do- ing so.

We still have much to learn about how to best curate datasets and training regimes for this class of systems. However, we have demonstrated in this work that our approach pro- vides roughly the same performance on high SNR OTA datasets as it does on the equiva- lent synthetic datasets, a major step towards real world use. We have demonstrated that transfer learning can be effective, but have not yet been able to achieve equivalent perfor- mance to direct training on very large datasets by using transfer learning. As simulation methods become better, and our ability to match synthetic datasets to real world data distributions improves, this gap will close and transfer learning will become and increas- ingly important tool when real data capture and labeling is difficult. The performance trades shown in this work help shed light on these key parameters in data generation and training, hopefully helping increase understanding and focus future efforts on the optimization of such systems. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 136

4.3 Learning to Identify Radio Protocols

The results in the previous section focused principally on sensing of modulation type, but the same fundamental approach is valid for labeling many different properties of digital communications waveforms at the PHY or MAC layer. As shown by Saineth et al in [36], features in a time series waveform which construct hierarchical time series structure among short-time features (such as voice utterances), can be learned in an end- to-end fashion with a higher level sequence model for effective sequence classification on noisy time series data. This has proved incredibly effective in voice recognition, and the approach can also be leveraged for higher level radio protocol identification on top of the basic modulation features [158].

Protocol identification serves an important role in network quality of service (QoS) man- agement, intrusion detection, and anomaly detection. Today, many such systems rely on brittle parsing routines which are highly specialized to a specific set of protocols, can be- come useless, or worse cause faults or vulnerabilities [159] when protocol fields change or are malformed, and can be extremely expensive and time consuming to keep up to date or constantly update to add new protocol modes. As an alternative, we consider a data-based approach in which high level protocol labeling can be conducted directly on a physical layer modulated signal through end-to-end learning of the low level modu- lation features, and high level classification loss guided by curated protocol labels and examples. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 137

Figure 4.34: Transfer function of the LSTM unit, from [16]

Table 4.8: Protocol traffic classes considered for classification

Traffic Type Traffic Class Streaming Video (ABC Video) Streaming Video (YouTube) Streaming Music (Spotify) Utilities Apt-get Utilities ICMP Ping Utilities Git Version Control Utilities IRC Chat Browsing Bit-Torrent Browsing Web-Browsing Browsing FTP Transfer Browsing HTTP Download

Several powerful recurrent network structures such as the long short-term memory (LSTM)

[160, 161], the gated recurrent unit (GRU) [162], and more recently the computationally efficient quasi-recurrent neural network (QRNN)[163]. For our work we leverage the

LSTM in both an RNN-DNN architecture and a CNN-RNN-DNN (CLDNN) architecture.

We generate a set of recorded IP traffic captures using Wireshark [164] from the list of protocols in table 4.8 and re-modulate them over an un-coded QPSK with HDLC com- munications link to produce labeled I/Q sample files for classification. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 138

Table 4.9: Recurrent network architecture used for network traffic classification

Layer Output dimensions Input N (2 128) × × LSTM 256 LSTM 256 LSTM 256 Dense + ReLU 64 Dense + softmax 11

Table 4.10: Performance measurements for RNN protocol classification for varying sequence lengths

Sequence Length Val. Loss Val. Accuracy Nsamples Nsymbols Nbits Sec/Epoch

32 1.2126 0.498805 1120 140 280 5 64 1.0386 0.553546 2144 268 536 18 128 0.7179 0.65894 4192 524 1048 17 256 0.4586 0.75621 8288 1036 2072 29 512 0.2711 0.836535 16480 2060 4120 38 768 0.5328 0.730413 24672 3084 6168 27

The recurrent neural network (RNN) network architecture evaluated (which in this case had the best performance on clean signal data), is shown in figure 4.9. No network tuning was used, this was the same network structure commonly used for character level RNN’s

(char-rnn [165]).

We evaluate a range of different input sequence lengths of the LSTM N, comparing the average number of input samples/bits required to obtain a good estimate of each mod- ulated protocol traffic type. Table 4.10 tabulates the resulting network performance for Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 139 training and evaluating classification performance with differing sequence lengths of 128 complex sample windows. Andrej Karpathy’s article title from the excellent article [165] is apt here, as the ’unreasonable effectiveness’ of LSTMs is able to quite effectively iden- tify high level traffic protocol behaviors with only access to raw modulated I/Q data. In this case, we obtain best performance with a sequence of 512 windows, with a validation set accuracy of around 83.6%.

Figure 4.35: Best LSTM256 confusion with RNN length of 512 time-steps

The confusion matrix for the resulting classifier performance with sequence length of

N = 512 (16,480 samples) is shown in figure 4.35. Since the observation window is only

16ms of traffic observation, some error is to be expected as not all observation windows will contain distinctive traffic patterns and all classes may have some amount of common Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 140 background traffic (domain name server (DNS), address resolution protocol (ARP), etc).

These results indicate some initial promise of deep learning based protocol analysis even down to the raw physical layer, but significant investment and work in larger scale ro- bust dataset development needs to occur to significantly advance the field. Our efforts to perform similar classification on impaired RF channels (including noise, fading, offsets, etc) were less successful with a straight forward RNN approach. We believe this avenue can certainly be fruitful (likely using a CLDNN style architecture), but newer tools for ar- chitecture optimization, hyper-parameter tuning, domain specific dataset augmentation, and generally larger datasets will be required to accomplish this task. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 141

4.4 Learning to Detect Signals

Figure 4.36: Detection Algorithm Trade-space Sensitivity vs Specialization

Radio signal detection is a key task in spectrum diagnostic and monitoring systems as well as cognitive radios such as those performing DSA. Today, systems which do de- tection typically have to make a difficult design choice: specialize detection algorithms heavily for features of a specific signal type or class of signals, or rely on highly generaliz- able energy based detection methods with lower sensitivity. This is an unfortunate design trade-off as it forces designers to either forego generality or performance during design or dynamically at run-time using additional complex logic and estimation [166]. The gen- erality of feature based detectors varies, for instance cyclo-stationary or moment based detectors may have more generality than highly specialized features such as matched fil- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 142 ters or cross ambiguity function (CAF) plane searches, but they are still highly specific to a narrow class of modulation types or properties which is problematic, especcially as learned communications systems drastically increase the range of signal types possi- ble. Figure 4.36 illustrates this trade-space at a high level, showing how objective based learned feature detectors fill a much desired void of obtaining both. This ideal class of de- tectors which achieves both high sensitivity and wide generality can be obtained through data centric machine learning approaches relying on feature learning, where, given suffi- cient data, highly sensitive features are learned for many different signal types using the same basic approach without the need for hand tuning or manual feature engineering.

There are many pre-processing signal representation domains in which detection strate- gies can be applied: raw time domain, frequency domain, wavelet domain, combinations of these, or others. As the most straightforward approach with analogues to existing work in computer vision, we consider the 2D time-frequency spectrogram plane for our work and leverage image object detection techniques which have already reached maturity, surpassing human levels of performance in many cases [156, 167]. The intuition for this approach is strong, as skilled domain engineers can regularly perform manual observa- tion on spectrogram images and identify and localize signals highly accurately with their eyes, illustrating the sufficient availability of information given the right interpretation.

This approach to object detection has in medical imaging and other non-visual domains recently come to the forefront, providing computer assisted diagnosis in radiology and other fields which in many cases outperforms panels of skilled radiologists in identifying Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 143 cancer [168], fractures, Alzheimer’s disease [169], and others.

Figure 4.37: Computer Vision CNN-based Object Detection Trade Space, from [17]

We consider the application of several leading computer vision object detection approaches to the task of radio signal detection in [18].

Each of the leading techniques in recent years has relied on CNNs for learned features on the front end, while numerous strategies exist for architectures, targets, loss functions, iteration, and training. Each of these relies on large training sets containing annotations with bounding boxes to indicate and localize ground truth of various object classes in the image. Networks then typically learn to predict bounding boxes, class labels, and confi- dence metrics through some means for which there are several strategies. Initial promis- ing solutions to the problem relied on region proposal networks such as region-based convolutional neural network (R-CNN) [170], Fast R-CNN [40], and newer versions of this technique which rely on conducting multiple network forwards passes for each ob- Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 144

Figure 4.38: Example bounding box detections in computer vision, from [17]

ject or region proposal in an image iteratively to refine the region prediction. This works well, but is quite expensive computationally and consequently slow when considering the throughput of many classifications on finite computing resources. In radio detection, we often seek to perform detection at extremely high rates and low latencies for many wide-band spectrum sensing tasks, where speed is key. The you only look once (YOLO) approach [171] solved this by proposing a single feed forward pass network which jointly produces object bound and class proposals for a grid of regions within the image simul- taneously. This approach of bounding box and class prediction within a single network forward pass was improved upon by SSD [172] and then improved further in [17]. Among the improvements are network architectures, as well as the use of anchor boxes, and im- proved loss functions for regression which led to numerous improvements. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 145

We use a network architecture for this work which is a variant of YOLO (known as tiny-

YOLO) as is described in table 4.11. Note that this network is much smaller than the

full-size one used in [17]. Compared to visual object recognition tasks, recognition of

spectral events is a relatively simpler task in many cases, allowing for smaller networks

to be used. Additionally, a smaller network helps to reduce over-fitting on the currently

available smaller datasets for the task, and reduces the computational complexity of for-

wards passes, resulting in lower power and faster operation.

Table 4.11: Table input/output shapes

Layer Number Layer Type Kernel Size Number of Feature Maps 1,2,3,4,5,6 Conv+Maxpool (3,3) 16,32,64,128,256,512 7,8 Conv (3,3) 1024,1024 9 Conv (1,1) 30

We train our system using the same approach as presented in the YOLO method, but we

can make a handful of simplifications for detection. We consider an S S grid of detec- × tions, predicting B bounding boxes for each cell along with a set of C class probabilities

as in [171]. We consider the YOLO loss function given below in equation 4.5, where 1obj

is evaluated only when the cell contains an object, and 1no−obj is evaluated only when

the cell does not contain an object. We do not use anchor boxes or Intersection over

union (IOU) loss for this initial work, performing direct regression of w and h instead,

leaving this for future work which we believe almost certainly yield further improve-

ments. Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 146

S2 B X X obj 2 LY OLO = λc 1ij DL2((xi, yi), (x ˆi, yˆi)) i=0 j=0 S2 B X X obj ˆ +λc 1ij DL2((wi, hi), (w ˆi, hi)) i=0 j=0 S2 B X X obj ˆ 2 + 1ij (Ci Ci) (4.5) i=0 j=0 − S2 B X X no−obj ˆ 2 +λno−obj 1ij (Ci Ci) i=0 j=0 − S2 X X 2 + (pi(c) pˆi(c)) − i=0 c∈classes

Figure 4.39: YOLO style per-grid-cell bounding box regression targets

Here, the first two terms of the loss minimize the L2 distance of the bounding box location

(x/y) and size (h/w) when an object is present (as shown in figure 4.39), while terms three and four minimize error in class prediction probabilities, and the final term minimizes a confidence metric. In our case, if we seek to perform object detection on a single class, Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 147

RF emissions, we can drop the third and forth terms and only perform bounding box regression and confidence estimation for a single object class, simplifying the task and network complexity significantly.

Figure 4.40: Radio bounding box detection examples, from [18]

In figure 4.40 we illustrate a synthetic wide-band bounding-box annotated radio dataset generated for the DARPA Battle-of-the-ModRecs competition using a set of our custom wide-band signal generation tools in GNU Radio [173]. Here we show ground truth bounding boxes along side predicted bounding boxes produced by our trained tiny-

YOLO detector on a validation portion of the dataset. In this case, we obtain excellent performance in predicting good bounding box annotations and maintain resilience to wideband noise emissions across the band which appear as energy as our detector.

We also illustrate the performance of the model as tested on an over the air wide-band spectrogram using tools being developed by DeepSig Inc. In figure 4.41, we show the Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 148

Figure 4.41: Over the air wideband signal bounding box prediction example

received radio spectrogram for an ISM band, with a series of rapid bursty radio emission occuring throughout. This spectrogram has been labeled with annotations using a similar

Yolo style network with bounding box regression and confidence prediction, where we have thresholded and removed all the low confidence boxes not shown. Here we can see that a number of traditionally difficult tasks such as discerning overlapping bursts, adjacent bursts, and heavily faded bursts are all handled appropriately.

This is a key result for the learned detector approach, through a generic process of human bounding box guidance we are able to rapidly train a detector to perform as desired for an unknown signal type without significant investment in additional specialized detection algorithms. This techniques is especially powerful as the detector as a receptive field is much more resilient to small impairments, occlusions (interference), or other distortions Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 149 in the signals which might have readily caused a simple energy based detector to mis- detect or poorly bound a radio signal emission. Work remains to be done to quantify the performance of the detector in a classical constant false alarm rate or receiver operating characteristic (ROC) curve style sensitivity analysis against the classical binned energy detector, but based on comparable results in computer vision and human visual capabil- ities when performing this task manually, we believe such a study in future work will yield excellent results soon. Chapter 5

Learning Radio Structure

Much of the work discussed to this point has been focused on either learning new physi- cal layer communications systems or learning in a supervised way how to detect, classify and label radio emissions. This chapter takes a step back and looks at how unlabeled radio signal data (which describes most available data in the world, and the data hitting our sensors) can be used in order to learn structure of radio signals, enable compression of radio signals, and to partition and learn to separate types of radio signals without train- ing or through a semi-supervised approach. It also takes a deeper dive into the question of how to select network architectures and hyper-parameters for training various tasks through approaching it as a guided model search problem, a key enabler for radio algo- rithm discovery and optimization.

150 Timothy J. O’Shea Chapter 5. Learning Radio Structure 151

5.1 Unsupervised Structure Learning

Widely used single-carrier radio signal time series modulations schemes today use a rel- atively simple set of supporting basis functions to modulate information into the radio spectrum. Digital modulations typically use sine wave basis functions with pseudo- orthogonal properties in phase, amplitude, or frequency. Information bits are used to map a symbol value si to a location in this space φj, φk, .... In figure 5.1 we show three common basis functions where φ0 and φ1 form phase-orthogonal bases used in PSK and

QAM, while φ0 and φ2 show frequency-orthogonal bases used in frequency shift key- ing (FSK) In the final figure of 5.1 we show a common mapping of constellation points into this space used in Quadrature Phase Shift Keying (QPSK) to encode two bits of in- formation per symbol.

Digital modulation theory in communications is a rich subject explored in much greater depth in numerous great texts such as [174].

Figure 5.1: Example Radio Communications Basis Functions Timothy J. O’Shea Chapter 5. Learning Radio Structure 152

We seek to learn a sparse representation using learned convolutional basis functions which maximally compresses radios signals of interest, obtaining the most sparse rep- resentation possible. Given there is random data modulated onto the radio signal and

CSI information stored about its arrival mode, there is certainly some information theo- retic limit to how compressed the information can become and still reconstruct the same information on a radio signal reconstruction. We can lower bound this by the entropy of the data bits, but likely need to also consider the entropy encoded into the encoded CSI.

Figure 5.2: Convolutional Autoencoder Architecture for Signal Compression

We set up a minimal convolutional autoencoder as shown in figure 5.2 where an input complex time domain radio signal is decomposed into a small set of convolutional fil- ters, compressed to a small number of activations through a fully-connected layer, then decompressed and reconstructed through a similar fully-connect and convolutional re- gression layer. In this case, we use linear activations on the convolutional layers, and non-linear activations only on the fully-connected compression layers. Timothy J. O’Shea Chapter 5. Learning Radio Structure 153

Figure 5.3: Convolutional Autoencoder reconstruction of QPSK example 1

Inspecting a QPSK signal compressed in this way in figure 5.3, we see that the complex continuous valued 88 sample input signal can be quite cleanly reconstructed at the output while passing through an intermediate layer of 44 intermediate values which saturated at 0 or 1. Interestingly, while representing only the structural portions of the signal in the basis functions, significant amounts of high frequency noise which does not lie on the basis function naturally has been removed in the reconstruction.

Another example is shown in figure 5.4 where relatively clean construction is achieved in the same way. Considering the compression occurring here, we have 88*2=176 float32 values for each input example, consisting of a total of approximately 5632 bits, while we have a saturated sparse representation of approximately 44 bits. This is a compression factor of approximately 128x.

If we instead consider the input signal to be dynamic range limited to approximately 20dB

SNR (assuming optimal representation scaling), we assume the signal can be represented Timothy J. O’Shea Chapter 5. Learning Radio Structure 154

Figure 5.4: Convolutional Autoencoder reconstruction of QPSK example 2

Figure 5.5: AE Encoder Filter Weights

Figure 5.6: AE Decoder Filter Weights

in 4-bit precision with quantization error not reducing SNR. (e.g. 6.02dB*4bits = 24.08dB ¿

20dB) then we can assume the input signal to be 704 bits of information compressed down to 44 bits, still a compression factor of 16x. This is relatively encouraging for a scheme which is perhaps the simplest convolutional autoencoder which could be employed for such a thing with no tuning.

Interestingly, if we inspect the filter weights learned in the convolutional encoding layer Timothy J. O’Shea Chapter 5. Learning Radio Structure 155 and convolutional decoding layer in figures 5.5 and 5.6, we can see that the basis functions for PSK modulation at the given relative symbol rate with RRC pulse shaping are learned directly in the filter weights. This raises an interesting possibility for discovering the basis functions for any new unknown modulation type simply based on learning a similar sparse representation thereof. It also raises the question of if some galois field (GF)(2) logic function exists to map the sparse representation bits into the transmitted data bits. If that is the case, through compression we would have just naively learned a demodulator for any new random modulation type solely through reconstruction loss.

Finally, the implications for denoising the input signal visible in figures 5.3 and 5.4 are quite interesting. Through projection onto basis functions and reconstruction therefrom, such an approach might offer a lower complexity alternative to full demodulation, re- modulation and subtraction currently used in successive interference cancelation (SIC) offering the possibility for a computationally cheaper version of this technique.

5.2 Unsupervised Class Discovery

Labeling of datasets can be expensive, difficult and time consuming. For this reason, as we turn increasingly to machine learning and data centric methods, it is important to develop methods which exploit unsupervised learning as much as possible to minimize the human curation requirement when unnecessary, and to maximally leverage human guidance when it is needed. In [175], we consider a collection of techniques for unsu- Timothy J. O’Shea Chapter 5. Learning Radio Structure 156 pervised and semi-supervied [176, 177, 178] identification of radio signal emission types using structure learning, sparse embedding, and clustering.

Dimensionality reduction techniques such as principal component analysis (PCA) [179], independent component analysis (ICA) [180] have been used widely in signal processing to obtain low dimension representations, to perform compression and de-noising, and other purposes. Non-linear versions such as kernel-PCA [181] exist which extend these methods into the non-linear representation domain, however choice of kernel is often extremely limiting non-linear representation capacity, and leaves much to be desired in terms of improved non-linear models for dimensionality reduction. Autoencoders with non-linear activations as discussed in section 5.1 offer a potential for significantly im- proved non-linear dimensionality reduction and representation beyond what has been achievable with prior methods. Recent work for instance in image and video compression domains [182, 183] has shown that such nonlinear autoencoder compression schemes can achieve better and more compressed low-dimensional representations of image domain examples than previously achievable with other techniques.

We consider both supervised and unsupervised methods for learning sparse representa- tions or embeddings of RF signal examples in figures 5.7 and 5.8. These are both compres- sive non-linear representations, but they have different objectives. In the case of the su- pervised method, discriminative features are learned which help impart human guidance on the objective class separation. In the case of the unsupervised method, reconstructive features are learned which simply try to best reconstruct each example through the non- Timothy J. O’Shea Chapter 5. Learning Radio Structure 157

Figure 5.7: Supervised Embedding Figure 5.8: Unsupervised Embedding Approach Approach

linear compressed representations of supporting learned convolutional basis functions which minimize reconstruction loss (e.g. MSE).

Each of these embeddings offers its own advantages, in the case of purely unsupervised of course, the appeal of zero labeling work is appealing as large amounts of unlabeled radio data are readily available. In the case of supervised learning, the features and rep- resentations are already guided towards signal type discrimination, but in some cases may not generalize well to separation of new modulation types.

In figure 5.9 and 5.10 we illustrate the resulting clustering of 11 radio signal modulation signal classes using these two embedding approaches. Embeddings are further reduced from 40 dimensions down to 2 for visualization using t-SNE [113]. For the supervised ∼ features we use the embedding of the final layer of a VGG-style CNN, prior to the fi- nal fully-connected SoftMax output layer, and for the unsupervised feature training we Timothy J. O’Shea Chapter 5. Learning Radio Structure 158

Figure 5.10: Unsupervised Signal Figure 5.9: Supervised Signal Embeddings Embeddings

use the output of a small convolutional autoencoder. We color example points with their class labels for visualization. For the supervised embedding clustering, we can see excel- lent separability of classes for virtually all classes, but label information was used in the creation of the feature space. For unsupervised embedding clustering, we can see some degree of separability in some of the more distinct classes (e.g. 8-PSK, AM-SSB, AM-DSB), but see significant mixing between similar modulation types which share common basis function properties (e.g. BPSK/QPSK mixing, QAM16/QAM64 mixing).

We can measure the ability of these approaches to generalize to some degree by training and clustering them using hold-out classes which are introduced after embedding space training, without labels. In doing so, we can begin to measure the quantitative accu- racy with which each approach successfully detects new classes as new clusters. We also create a clustering representation in which a human curator can begin to label examples by cluster rather than by individual example. These are both important steps towards Timothy J. O’Shea Chapter 5. Learning Radio Structure 159 creating learning systems which scale and learn from new data and emitters over time, however much of the quantitative analysis and optimization of this approach is left for future work.

5.3 Neural Network Model Discovery and Optimization

One of the biggest problems in the use of artificial neural networks for machine learn- ing is the task of architecture selection and hyper-parameter optimization. Architectures can make an enormous difference in the performance of a neural network in terms of accuracy and computational cost (as recently demonstrate in [184]), by introducing ap- propriate classes of tied weights (e.g. convolutional layers, dilated convolutions) and by appropriately managing the degrees of freedom in a network (e.g. pooling, striding, etc) to preserve enough information at each layer while keeping the free-parameter count low enough and incorporating a domain appropriate distillation mode for information.

In section 5.3 we review a number of the published state of the art approaches in re- cent deep learning literature for solving this problem. Unfortunately, many of these ap- proaches are too computationally complex for people with finite computing resources and funding (i.e. other than Google/Facebook).

As a solution we develop a model based on a simplified version of Google’s evolutionary model search approach in [8]. Here we represent a directed graph of high level neural network primitives and key hyper-parameters as a compact model description as shown Timothy J. O’Shea Chapter 5. Learning Radio Structure 160

Figure 5.11: Compact Model Network Digraph and Hyper-Parameter Search Process

in figure 5.11. We implement evolutionary routines [185] for random model generation, mutation and crossover of model graph structure and hyper-parameters, and leverage an evolutionary particle swarm optimization [186] approach to generating and breeding populations of models. In contrast to the approach in [8] which we presume is run across a large distributed cluster of computing nodes (to support population sizes of 1000), we evaluate our model on a single Nvidia Digits development server with 4 Titan X GPU cards with substantially smaller population sizes and search lengths. We call this ap- proach EvolNN (evolutionary neural network).

Evaluating the model on several benchmark test sets, evolutionary model search finds so- lutions which score quite well on standard benchmarks like MNIST fairly readily (figure

5.13), while we can also apply the search problem to very difficult datasets such as the hard 24-modulation dataset from section 4.2.3, shown in figure 5.13. In this case, the tasks Timothy J. O’Shea Chapter 5. Learning Radio Structure 161

Figure 5.12: EvolNN ModRec Net Search Figure 5.13: EvolNN MNIST Net Search Accuracy Accuracy

Table 5.2: Final Modrec search CNN network Table 5.1: Final small MNIST search CNN network Layer Output dimensions Input 1024 2 Layer Output dimensions × Conv 335 11 Input 28 28 1 × × × Conv 323 256 Conv 24 24 104 × × × AvgPool 107 256 Dropout 24 24 104 × × × MaxPool 21 256 Flatten 59904 × MaxPool 5 256 FC/SoftMax 10 × Flatten 1280 FC/SoftMax 24 of image and modulation classification are completely different domains, but the same evolutionary approach is able to find reasonable solutions to both very quickly.

Both of these task are configured by providing a reference dataset, with input and output shapes, a loss function for classification using CCE, and an evolutionary model configu- ration including population size, generations, cross-over rate, mutation rate, etc. For the search accuracy trajectories shown above, we ultimately obtain the best models given in table 5.1 and 5.2. Timothy J. O’Shea Chapter 5. Learning Radio Structure 162

For the MNIST model, we find the solution in only 4 generations of population size 32 with the best model achieving an accuracy of 99.22% on the validation set. For the mod- ulation recognition task, a significantly more difficult task, we obtain a slightly larger network, which learns to narrow the information representation gradually using several convolutional layers and pooling layers. In this case the best performance is only 42% validation set accuracy, compared to the 76% achieved through expert design, but we ∼ observe a stead slow growth in performance throughout the evolution process, and be- lieve with additional search tuning and longer search times much better models could be found through this approach.

Figure 5.14: EvolNN CFO estimation network search loss

By simply changing the objective loss function of the evolutionary process (in this case to

MSE) we can use the same infrastructure to search for optimal regression networks. In this case, the CFO estimation network we previously showed in table 4.1 is the best model found for a model search on our CFO estimation dataset and task. Figure 5.14 shows the Timothy J. O’Shea Chapter 5. Learning Radio Structure 163 evolutionary model loss over a number of generation, where we can see the estimator

MSE converging to smaller values throughout the search process and ultimately arriving at a best MSE of 0.0011. Here we search for 32 generations each with a population size of

32.

Neuro-evolution [29] is a very powerful tool, and holds significant biological grounding in living creatures. We leverage this very high level intuition for evolutionary model selection and loss feedback based model optimization, both are very very rough approx- imations of how we believe biological learning and evolution occur. Both of these pro- cesses seem to have a very long way to go before any notion of optimality is reached, but initial results are still very promising and provide reasonably good results and generality on new tasks such as estimator synthesis for which no literature exists in best practices for manually crafting and optimizing model architectures. The models shown here are still trivially small compared with full size state of the art architectures used today, but sufficient computational cost and evolutionary tuning will close this gap. As computing costs and data become increasingly cheaper, such guided search approaches to model and architecture selection are increasingly appealing when compared with lengthy, expensive and less effective manual architecture engineering and tuning cycles. Chapter 6

Conclusion

Machine learning and parallel computing have provided a set of incredibly powerful tools over recent years which have opened up orders of magnitude improvement in our ability to optimize very large scale high degree of freedom problems through direct gra- dient descent on well formed loss functions. While these tools are being readily applied in the computer vision and NLP spaces today, the full impact of their engineering impact will not be realized in applications and in industry for many years to come. These tools represent a major shift in algorithm design away from simplified model based solutions to problems and specialized software routines towards data-centric model optimization using highly general parametric models capable of learning very highly dimensional so- lutions to many difficult tasks through end-to-end learning.

This enormous shift in design methodology does not mean we can’t perform quantita-

164 Timothy J. O’Shea Chapter 6. Conclusion 165 tive analysis, measurement, and probabilistic characterization of the performance of such models, but it does make predicting or guaranteeing performance somewhat more diffi- cult in many cases principally because they are derived from dataset distributions which in themselves are not well characterized and are formed from complex real world dis- tributions. Many of the probabilistic tools for guaranteeing, explaining and optimizing performance are catching up quickly, but since such models now rely heavily on high- dimensional datasets directly for learning, perfomance guarantees will neccisarily be- come a much more complex function of the dataset distribution rather than of a compact simplified model as well.

This same shift was extremely contentious in the computer vision domain before wide spread adoption and the same resistance is being felt in many other fields including radio signal processing. The hostility towards learning directly from rich distributions of large datasets rather than assuming conventional compact models, which have been used for years in the wireless space, is contentious. As stated by an anonymous [highly negative] reviewer for a conference paper this past year, ”Radio spectra are not mere images of cats but are issued from well acknowledged and fairly accurate wireless communication models”, many people are not pleased with attempts rely on data instead of solely these models. In reality many of the models used are insufficient, and there is much still to gain by leveraging the best of both worlds.

Ultimately we are at a crossroads in wireless and signal processing, where practitioners of both analytic compact model construction and large approximate model construction Timothy J. O’Shea Chapter 6. Conclusion 166 must both adopt good practices for data science such as adopting benchmark tasks and datasets which truly reflect useful target tasks in the real world. We have attempted to help address this issue by open sourcing and publishing several datasets throughout this work which can be fairly compared in a quantitative fashion across numerous classes of approach. We truly hope that more people will adopt this approach to algorithm develop- ment and optimization, making benchmarks open and quantitatively tracking approach scores in a way similar to ImageNet [1], CIFAR [107], Kaggle challenges [187], or other well characterized and scored tasks.

Funding agencies such as DARPA, NIST, NSF, or industry can significantly help this pro- cess by explicitly funding, promoting and publishing high quality datasets and data to accompany desired tasks, which is no trivial task and can often require significant real in- vestment. This approach has been highly successful in vision and other fields and stands to revolutionize how communications system engineering is done today.

While much of the the work throughout my dissertation studies has perhaps raised many more questions than it has answered about the topic, I believe many of the data-centric approaches to radio signal sensing, labeling, communications system synthesis, and de- sign designed herein all hold a high degree of inevitability for the field. Certainly specific optimization techniques and architectures will continue to change and advance over the coming years, but the basic shift towards optimizaton of high degree of freedom mod- els on real datasets and impairment models seems likely to rapidly become the norm as quantitative performance results become stronger and more widely disseminated and Timothy J. O’Shea Chapter 6. Conclusion 167 accepted. The list of potential research directions and applications in the field as this transition occurs represents an enormously rich array of possibilities and areas for im- provement. Initial results shown herein provide significant evidence of the disruptive potential for improvement in the radio signal processing space, in communications sys- tem learning, sensor system learning, and many similar applications considered from a machine learning perspective. Building, integrating and deploying such systems into the real world holds many remaining engineering challenges, but I look forward to rapidly maturing this field and learning from others as the field grows and machine learning based radio physical layers and signal processing techniques improve.

6.1 Publication List

Below is the relevant body of academic published work corresponding to my dissertation research over the past several years.

Journal Articles

T. O’Shea, J. Hoydis [An Introduction to Deep Learning for the Physical Layer], IEEE • Transactions on Cognitive Communications Systems, 2017 (accepted)

T. O’Shea, T. Roy, T. Clancy [Over the Air Deep Learning Based Radio Signal Clas- • sification] IEEE JSTSP 2017 (accepted)

T. O’Shea, T. Erpek, T. Clancy [Deep Learning-Based MIMO Communications], (un- • Timothy J. O’Shea Chapter 6. Conclusion 168

der resubmission)

T. O’Shea, T. Clancy, T. Roy, T. Erpek, K. Karra [Deep Learning and Data Centric • Approaches to Wireless Signal Processing Systems], (under resubmission)

C Clancy, J Hecker, E Stuntebeck, T O’Shea [Applications of machine learning to • cognitive radio networks] Wireless Communications, IEEE 14 (4), 47-52

Peer Reviewed Conference Papers

T. OShea, T. Roy, T. Clancy, [Learning Robust General Radio Signal Detection using • Computer Vision Methods], Asilomar SSC 2017 (to appear)

T. OShea, T. Erpek, T. Clancy, [Physical Layer Deep Learning of Encodings for the • MIMO Fading Channel], Allerton Conference on Communications, Control, and

Computing 2017

T. OShea, K. Karra, T. Clancy, [Learning Approximate Neural Estimators for Wire- • less Channel State Information], IEEE MLSP 2017

T. OShea, T. Roy, T. Erpek, [Spectral Detection and Localization of Radio Events with • Learned Convolutional Neural Features], IEEE EUSIPCO 2017

T. OShea, N. West, M. Vondal, T. Clancy [Semi-Supervised Radio Signal Identifi- • cation], IEEE International Conference on Advanced Communications Technology,

2017 (outstanding paper award) Timothy J. O’Shea Chapter 6. Conclusion 169

N. West, T. OShea [Deep Architectures for Modulation Recognition], IEEE DySpan, • 2017

T. OShea, S. Hitefield, J. Corgan [End-to-end Traffic Sequence Recognition with Re- • current Neural Networks], IEEE GlobalSip, 2016

T. OShea, K. Karra, T. Clancy, [Learning to Communicate: Channel auto-encoders, • Domain Specific Regularizers, and Attention], IEEE International Symposium on

Signal Processing and Information Technology 2016

T. OShea, L. Pemula, D. Batra, T. Clancy, [Radio Transformer Networks: Attention • Models for Learning to Synchronize in Wireless Systems], IEEE Asilomar Confer-

ence on Signals, Systems and Computing 2016

T. OShea, N. West, [Radio Machine Learning Dataset Generation with GNU Radio], • GNU Radio Conference 2016

T. OShea, J. Corgan, T. Clancy, [Unsupervised Representation Learning of Struc- • tured Radio Communication Signals] International Workshop on Sensing, Process-

ing and Learning for Intelligent Machines 2016

T. OShea, J. Corgan, T. Clancy, [Convolutional Radio Modulation Recognition Net- • works] Engineering Applications of Neural Networks 2016

D. CaJacob, N. McCarthy, T. O’Shea, R. McGwier, [Geolocation of RF Emitters with • a Formation-Flying Cluster of Three Microsatellites] Small Satellite Conference 2016 Timothy J. O’Shea Chapter 6. Conclusion 170

T. O’Shea, K. Karra [GNU Radio Signal Processing Models for Dynamic Multi-User • Burst Modems] Software Radio Implementation Forum 2015

S. Hitefield, V. Nguyen, C. Carlson, T. O’Shea, T. Clancy [Demonstrated LLC-layer • attack and defense strategies for wireless communication systems] IEEE Conference

on Communications and Network Security (CNS) 2014

C. Carlson, V. Nguyen, S. Hitefield, T. O’Shea, T. Clancy [Measuring smart jammer • strategy efficacy over the air] IEEE Conference on Communications and Network

Security (CNS) 2014

T. O’Shea, T. Rondeau, [A universal GNU radio performance benchmarking suite], • Karlsruhe Workshop on Software Radio 2014

T. Rondeau, T. O’Shea, [Designing Analysis and Synthesis Filterbanks in GNU Ra- • dio], Karlsruhe Workshop on Software Radios 2014

Pre-Publication Papers

T. OShea, T. Clancy, [Deep Reinforcement Learning Radio Control and Signal De- • tection with KeRLym, a Gym RL Agent], ArXiv Pre-publication 1605.09221 2016

T. O’Shea, T. Clancy, R. McGwier [Recurrent Neural Radio Anomaly Detection], • ArXiv Pre-Publication 1611.00301 2016

T. O’Shea, A. Mondl, T. Clancy [A Modest Proposal for Open Market Risk Assess- • ment to Solve the Cyber-Security Problem] ArXiv Pre-Publication 1604.08675 2016 Timothy J. O’Shea Chapter 6. Conclusion 171

Invited/Non-Paper Talks

T. O’Shea, [The Future of Radio: Learning Efficient Signal Processing Systems], • GNU Radio Conference 2017

T. O’Shea, [Learning Signal Processing and Communications Systems from Data], • IEEE CCAA Workshop Keynote 2017

T. O’Shea, [Deep Learning on the Radio Physical Layer], JASON 2017 Summer • Study

T. OShea, [TensorFlow Applications in Signal Processing], IEEE International Con- • ference for High Performance Computing, Networking, Storage and Analysis (Ten-

sorFlow BoF Hosted by Google) 2016

T. OShea, [Radio Data Analytics with Machine Learning], International Symposium • on Advanced Radio Technologies (ISART) 2016

R. McGwier, T. OShea, K. Karra, M. Fowler, [Recent Developments in Artificial Intel- • ligence Applications of Deep Learning for Signal Processing], Virginia Tech Wireless

Symposium 2016

T. OShea, [Handing Full Control of the Radio Spectrum over to the Machines], DE- • FCON Wireless Village 2016

T. OShea, [Radio Machine Learning with FOSS, GNU Radio and TensorFlow] FOS- • DEM 2016 Timothy J. O’Shea Chapter 6. Conclusion 172

T. OShea, [Rapid GNU Radio GPU Algorithm Prototyping from Python (gr-theano)], • FOSDEM 2015

T. O’Shea, [GNU Radio Tools for Radio Wrangling and Spectrum Domination], DE- • FCON 23 Wireless Village 2015

T. O’Shea, [Tutorial: Exploring Data], GNU Radio Conference 2015 • Bibliography

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep

convolutional neural networks,” in Advances in neural information processing systems,

2012, pp. 1097–1105.

[2] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-

brenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw

audio,” arXiv preprint arXiv:1609.03499, 2016.

[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: a simple way to prevent neural networks from overfitting.” Journal of

Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,

pp. 770–778.

[5] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in

Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.

173 Timothy J. O’Shea Chapter 6. Conclusion 174

[6] C. Moore, “Data processing in exascale-class computer systems,” in The Salishan

Conference on High Speed Computing, 2011.

[7] J.-H. Huang, “Keynote and volta series product announcement,” in GPU Technology

Conference, 2017.

[8] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. V. Le, and A. Kurakin,

“Large-scale evolution of image classifiers,” CoRR, vol. abs/1703.01041, 2017.

[Online]. Available: http://arxiv.org/abs/1703.01041

[9] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net-

works,” in European conference on computer vision. Springer, 2014, pp. 818–833.

[10] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net-

works: Visualising image classification models and saliency maps,” arXiv preprint

arXiv:1312.6034, 2013.

[11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-

cam: Visual explanations from deep networks via gradient-based localization,” See

https://arxiv. org/abs/1610.02391 v3, 2016.

[12] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural

networks via information,” CoRR, vol. abs/1703.00810, 2017. [Online]. Available:

http://arxiv.org/abs/1703.00810 Timothy J. O’Shea Chapter 6. Conclusion 175

[13] “Lte phy lab, e-utra phy golden reference model,” https://www.is-wireless.com/

5g-toolset-old/lte-phy-lab-old/, (Accessed on 10/01/2017).

[14] F. Chollet, “Buiulding autoencoders in keras,” https://blog.keras.io/

building-autoencoders-in-keras.html, (Accessed on 10/01/2017).

[15] O. A. Dobre, A. Abdi, Y. Bar-Ness, and W. Su, “Survey of automatic modulation

classification techniques: classical approaches and new trends,” IET communica-

tions, vol. 1, no. 2, pp. 137–156, 2007.

[16] C. Olah, “Understanding lstm networks,” Online Article

http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 2015.

[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint

arXiv:1612.08242, 2016.

[18] T. J. O’Shea, T. Roy, and T. C. Clancy, “Learning robust general radio signal detec-

tion using computer vision methods,” in 2016 51th Asilomar Conference on Signals,

Systems and Computers, Nov 2017.

[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[20] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical

Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948.

[21] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-correcting

coding and decoding: Turbo-codes. 1,” in Communications, 1993. ICC’93 Geneva. Timothy J. O’Shea Chapter 6. Conclusion 176

Technical Program, Conference Record, IEEE International Conference on, vol. 2. IEEE,

1993, pp. 1064–1070.

[22] R. M. Pyndiah, “Near-optimum decoding of product codes: Block turbo codes,”

IEEE Transactions on communications, vol. 46, no. 8, pp. 1003–1010, 1998.

[23] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information the-

ory, vol. 8, no. 1, pp. 21–28, 1962.

[24] R. v. Nee and R. Prasad, OFDM for wireless multimedia communications. Artech

House, Inc., 2000.

[25] P. Patel and J. Holtzman, “Analysis of a simple successive interference cancella-

tion scheme in a ds/cdma system,” IEEE journal on selected areas in communications,

vol. 12, no. 5, pp. 796–807, 1994.

[26] H. Wymeersch, Iterative receiver design. Cambridge University Press Cambridge,

2007, vol. 234.

[27] D. J. Jakubisin, R. M. Buehrer, and C. R. da Silva, “Bp, mf, and ep for joint channel

estimation and detection of mimo-ofdm signals,” in Global Communications Confer-

ence (GLOBECOM), 2016 IEEE. IEEE, 2016, pp. 1–6.

[28] M. J. Demongeot, M. J. Mazoyer, M. P. Peretto, and M. D. Whitley, “Neural network

synthesis using cellular encoding and the genetic algorithm.” 1994. Timothy J. O’Shea Chapter 6. Conclusion 177

[29] J. Branke, “Evolutionary algorithms for neural network design and training,” in

In Proceedings of the First Nordic Workshop on Genetic Algorithms and its Applications.

Citeseer, 1995.

[30] J. Bruck and M. Blaum, “Neural networks, error-correcting codes, and polynomials

over the binary n-cube,” IEEE Transactions on information theory, vol. 35, no. 5, pp.

976–987, 1989.

[31] F. Jondral, “Automatic classification of high frequency signals,” Signal Processing,

vol. 9, no. 3, pp. 177–190, 1985.

[32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector

machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28,

1998.

[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-

scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009.

CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.

[34] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Interna-

tional journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[35]J.S anchez´ and F. Perronnin, “High-dimensional signature compression for large-

scale image classification,” in Computer Vision and Pattern Recognition (CVPR), 2011

IEEE Conference on. IEEE, 2011, pp. 1665–1672. Timothy J. O’Shea Chapter 6. Conclusion 178

[36] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the

speech front-end with raw waveform cldnns,” in Sixteenth Annual Conference of the

International Speech Communication Association, 2015.

[37] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Iden-

tifying and attacking the saddle point problem in high-dimensional non-convex

optimization,” in Advances in neural information processing systems, 2014, pp. 2933–

2941.

[38] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural

networks with pruning, trained quantization and huffman coding,” arXiv preprint

arXiv:1510.00149, 2015.

[39] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional

neural networks for resource efficient inference,” 2016.

[40] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-

puter vision, 2015, pp. 1440–1448.

[41] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits

in natural images with unsupervised feature learning,” in NIPS workshop on deep

learning and unsupervised feature learning, vol. 2011, no. 2, 2011, p. 5.

[42] W. Namgoong and T. H. Meng, “Direct-conversion rf receiver design,” IEEE Trans-

actions on Communications, vol. 49, no. 3, pp. 518–529, 2001. Timothy J. O’Shea Chapter 6. Conclusion 179

[43] R. AD9361, “Agile transceiver, data sheet, analog devices,” Inc, vol. 2, p. 014, 2013.

[44] H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the

American Institute of Electrical Engineers, vol. 47, no. 2, pp. 617–644, 1928.

[45] M. Ettus, “Universal software radio peripheral,” 2009.

[46] T. O’Shea, “Gnu radio channel simulation,” in GNU Radio Conference 2013, 2013.

[47] D. Middleton, I. of Electrical, and E. Engineers, An introduction to statistical commu-

nication theory. IEEE press Piscataway, NJ, 1996.

[48] J. Mitola and G. Q. Maguire, “Cognitive radio: making software radios more per-

sonal,” IEEE personal communications, vol. 6, no. 4, pp. 13–18, 1999.

[49] T. W. Rondeau, “Application of artificial intelligence to wireless communications,”

Ph.D. dissertation, Virginia Polytechnic Institute and State University, 2007.

[50] P. J. Kolodzy, “Dynamic spectrum policies: promises and challenges,” CommLaw

Conspectus, vol. 12, p. 147, 2004.

[51] T. C. Clancy, “Dynamic spectrum access in cognitive radio networks,” Ph.D. disser-

tation, 2006.

[52] W. Gardner, W. Brown, and C.-K. Chen, “Spectral correlation of modulated signals:

Part ii–digital modulation,” IEEE Transactions on Communications, vol. 35, no. 6, pp.

595–601, 1987. Timothy J. O’Shea Chapter 6. Conclusion 180

[53] S. Geirhofer, L. Tong, and B. M. Sadler, “Cognitive radios for dynamic spectrum

access-dynamic spectrum access in the time domain: Modeling and exploiting

white space,” IEEE Communications Magazine, vol. 45, no. 5, 2007.

[54] Z. Ji and K. R. Liu, “Cognitive radios for dynamic spectrum access-dynamic

spectrum sharing: A game theoretical overview,” IEEE Communications Magazine,

vol. 45, no. 5, 2007.

[55] A. Amanna and J. H. Reed, “Survey of cognitive radio architectures,” in IEEE South-

eastCon 2010 (SoutheastCon), Proceedings of the. IEEE, 2010, pp. 292–297.

[56] E. Stuntebeck, T. OShea, J. Hecker, and T. Clancy, “Architecture for an open-source

cognitive radio,” in Proceedings of the SDR forum technical conference, 2006.

[57] P. J. Werbos, “Applications of advances in nonlinear sensitivity analysis,” in System

modeling and optimization. Springer, 1982, pp. 762–770.

[58] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural

networks, vol. 12, no. 1, pp. 145–151, 1999.

[59] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization

and momentum in deep learning,” in International conference on machine learning,

2013, pp. 1139–1147.

[60] A. Nemirovskii, D. B. Yudin, and E. R. Dawson, “Problem complexity and method

efficiency in optimization,” 1983. Timothy J. O’Shea Chapter 6. Conclusion 181

[61] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running

average of its recent magnitude,” COURSERA: Neural Networks for Machine Learn-

ing, vol. 4, no. 2, 2012.

[62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint

arXiv:1412.6980, 2014.

[63] J. Zhang, I. Mitliagkas, and C. Re,´ “Yellowfin and the art of momentum tuning,”

arXiv preprint arXiv:1706.03471, 2017.

[64] C. Xu, T. Qin, G. Wang, and T.-Y. Liu, “Reinforcement learning for learning rate

control,” arXiv preprint arXiv:1705.11159, 2017.

[65] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random synaptic

feedback weights support error backpropagation for deep learning,” Nature com-

munications, vol. 7, 2016.

[66] B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gap between

energy-based models and backpropagation,” Frontiers in computational neuroscience,

vol. 11, 2017.

[67] D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with

inverse autoregressive flow,” arXiv preprint arXiv:1606.04934, 2016.

[68] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma-

chines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 807–814. Timothy J. O’Shea Chapter 6. Conclusion 182

[69] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Pro-

ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,

2011, pp. 315–323.

[70] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural

network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013.

[71] J. S. Bridle, “Training stochastic model recognition algorithms as networks can lead

to maximum mutual information estimation of parameters,” in Advances in neural

information processing systems, 1990, pp. 211–217.

[72] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network

learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.

[73] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural

networks,” arXiv preprint arXiv:1706.02515, 2017.

[74] C. Dugas, Y. Bengio, F. Belisle,´ C. Nadeau, and R. Garcia, “Incorporating second-

order functional knowledge for better option pricing,” in Advances in neural infor-

mation processing systems, 2001, pp. 472–478.

[75] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time

series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995,

1995. Timothy J. O’Shea Chapter 6. Conclusion 183

[76] A. B. Geva, “Scalenet-multiscale neural-network architecture for time series predic-

tion,” IEEE Transactions on neural networks, vol. 9, no. 6, pp. 1471–1482, 1998.

[77] M. Sundermeyer, R. Schluter,¨ and H. Ney, “Lstm neural networks for language

modeling,” in Thirteenth Annual Conference of the International Speech Communication

Association, 2012.

[78] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual predic-

tion with lstm,” 1999.

[79] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural

networks using dropconnect,” in Proceedings of the 30th international conference on

machine learning (ICML-13), 2013, pp. 1058–1066.

[80] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training

by reducing internal covariate shift,” in International Conference on Machine Learning,

2015, pp. 448–456.

[81] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint

arXiv:1505.00387, 2015.

[82] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,

vol. 38, no. 8, 1965.

[83] M. Gschwind, “Chip multiprocessing and the cell broadband engine,” in Proceed-

ings of the 3rd conference on Computing frontiers. ACM, 2006, pp. 1–8. Timothy J. O’Shea Chapter 6. Conclusion 184

[84] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina,

C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-chip interconnection architecture

of the tile processor,” IEEE micro, vol. 27, no. 5, pp. 15–31, 2007.

[85] N. McCarthy, E. Blossom, N. Goergen, T. OShea, and C. Clancy, “High-performance

sdr: Gnu radio and the ibm cell broadband engine,” in Virginia Tech Wireless Personal

Communications Symposium, 2008.

[86] C. Nvidia, “Compute unified device architecture programming guide,” 2007.

[87] J. Hensley, “Close to the metal,” SIGGRAPH’07, 2007.

[88] A. Munshi, “Opencl: Parallel computing on the gpu and cpu,” SIGGRAPH, Tutorial,

pp. 11–15, 2008.

[89] G. Harrison, A. Sloan, W. Myrick, J. Hecker, and D. Eastin, “Polyphase channeliza-

tion utilizing general-purpose computing on a gpu,” in SDR 2008 technical conference

and product exposition, 2008.

[90] G. F. Zaki, W. Plishker, T. Oshea, N. McCarthy, C. Clancy, E. Blossom, and S. S.

Bhattacharyya, “Integration of dataflow optimization techniques into a software

radio design framework,” in Signals, Systems and Computers, 2009 Conference Record

of the Forty-Third Asilomar Conference on. IEEE, 2009, pp. 243–247. Timothy J. O’Shea Chapter 6. Conclusion 185

[91] M. Piscopo, “Study on implementing opencl in common gnuradio blocks,”

Proceedings of the GNU Radio Conference, vol. 2, no. 1, p. 67, 2017. [Online]. Available:

https://pubs.gnuradio.org/index.php/grcon/article/view/15

[92] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,

D. Warde-Farley, and Y. Bengio, “Theano: A cpu and gpu math compiler in

python,” in Proc. 9th Python in Science Conf, 2010, pp. 1–7.

[93] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open source scientific

tools for Python,” 2001–, [Online; accessed ¡today¿]. [Online]. Available:

http://www.scipy.org/

[94] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on

heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

[95] J. McCarthy, “Recursive functions of symbolic expressions and their computation

by machine, part i,” Communications of the ACM, vol. 3, no. 4, pp. 184–195, 1960.

[96] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.

[97] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,

and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in

Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp.

675–678. Timothy J. O’Shea Chapter 6. Conclusion 186

[98] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source

framework for deep learning,” in Proceedings of workshop on machine learning systems

(LearningSys) in the twenty-ninth annual conference on neural information processing sys-

tems (NIPS), vol. 5, 2015.

[99] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment

for machine learning,” in BigLearn, NIPS Workshop, 2011.

[100] A. Paszke, S. Gross, and S. Chintala, “Pytorch,” 2017.

[101] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and

Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heteroge-

neous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.

[102] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K. Snderby, D. Nouri et al., “Lasagne:

First release.” Aug. 2015. [Online]. Available: http://dx.doi.org/10.5281/zenodo.

27878

[103] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter

optimization through reversible learning,” in Proceedings of the 32nd International

Conference on Machine Learning, 2015.

[104] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online learning

rate adaptation with hypergradient descent,” arXiv preprint arXiv:1703.04782, 2017. Timothy J. O’Shea Chapter 6. Conclusion 187

[105] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,”

arXiv preprint arXiv:1611.01578, 2016.

[106] T. Desell, “Large scale evolution of convolutional neural networks using

volunteer computing,” CoRR, vol. abs/1703.05422, 2017. [Online]. Available:

http://arxiv.org/abs/1703.05422

[107] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny im-

ages,” 2009.

[108] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features

of a deep network,” University of Montreal, vol. 1341, p. 3, 2009.

[109] F. E. Terman et al., “Radio engineering,” 1937.

[110] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel auto-

encoders, domain specific regularizers, and attention,” in 2016 IEEE International

Symposium on Signal Processing and Information Technology (ISSPIT), Dec 2016, pp.

223–228.

[111] T. OShea and J. Hoydis, “An introduction to deep learning for the physical layer,”

IEEE Transactions on Cognitive Communications and Networking, vol. PP, no. 99, pp.

1–1, 2017.

[112] Y. Li, R. Xu, and F. Liu, “Whiteout: Gaussian adaptive regularization noise in deep

neural networks,” arXiv preprint arXiv:1612.01490, 2016. Timothy J. O’Shea Chapter 6. Conclusion 188

[113] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res.,

vol. 9, no. Nov, pp. 2579–2605, 2008.

[114] F. Liang, C. Shen, and F. Wu, “An iterative bp-cnn architecture for channel decod-

ing,” arXiv preprint arXiv:1707.05697, 2017.

[115] S. Cammerer, T. Gruber, J. Hoydis, and S. t. Brink, “Scaling deep learning-based

decoding of polar codes via partitioning,” arXiv preprint arXiv:1702.06901, 2017.

[116] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning-based channel

decoding,” in 2017 51st Annual Conference on Information Sciences and Systems (CISS),

March 2017, pp. 1–6.

[117] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio transformer networks:

Attention models for learning to synchronize in wireless systems,” in 2016 50th

Asilomar Conference on Signals, Systems and Computers, Nov 2016, pp. 662–666.

[118] N. TSGRANGRA, “Evolved universal terrestrial radio access (e-utra); multiplexing

and channel coding,” 3rd Generation Partnership Project (3GPP), vol. TS, vol. 36, 2009.

[119] L. ETSI, “Evolved universal terrestrial radio access (e-utra); physical channels and

modulation,” ETSI TS, vol. 136, no. 211, p. V9.

[120] D. Gesbert, M. Shafi, D.-s. Shiu, P. J. Smith, and A. Naguib, “From theory to practice:

An overview of mimo space-time coded wireless systems,” IEEE Journal on selected

areas in Communications, vol. 21, no. 3, pp. 281–302, 2003. Timothy J. O’Shea Chapter 6. Conclusion 189

[121] E. Luther, “5g massive mimo testbed: From theory to reality,” white paper, avail-

able online: https://studylib. net/doc/18730180/5g-massive-mimo-testbed–from-theory-to-

reality, 2014.

[122] V. Tarokh, H. Jafarkhani, and A. R. Calderbank, “Space-time block codes from or-

thogonal designs,” IEEE Transactions on Information theory, vol. 45, no. 5, pp. 1456–

1467, 1999.

[123] S. M. Alamouti, “A simple transmit diversity technique for wireless communica-

tions,” IEEE Journal on Selected Areas in Communications, vol. 16, no. 8, pp. 1451–1458,

Oct 1998.

[124] A. J. Paulraj, R. W. Heath Jr, P. K. Sebastian, and D. J. Gesbert, “Spatial multiplexing

in a ,” May 23 2000, uS Patent 6,067,290.

[125]S.D orner,¨ S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning-Based Com-

munication Over the Air,” ArXiv e-prints, Jul. 2017.

[126] T. Soderstr¨ om¨ and P. Stoica, System identification. Prentice-Hall, Inc., 1988.

[127] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, “Backpropagation

through the void: Optimizing control variates for black-box gradient estimation,”

arXiv preprint arXiv:1711.00123, 2017. Timothy J. O’Shea Chapter 6. Conclusion 190

[128] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning approximate neural estimators

for wireless channel state information,” in 2016 IEEE International Workshop on Ma-

chine Learning for Signal Processing (MLSP), Sep 2017.

[129] Y. Wang, K. Shi, and E. Serpedin, “Non-data-aided feedforward carrier frequency

offset estimators for qam constellations: A nonlinear least-squares approach,”

EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 13, p. 856139, 2004.

[130] O. Catoni et al., “Challenging the empirical mean and empirical variance: a devia-

tion study,” in Annales de l’Institut Henri Poincar´e,Probabilit´eset Statistiques, vol. 48,

no. 4. Institut Henri Poincare,´ 2012, pp. 1148–1185.

[131] T. J. OShea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recogni-

tion networks,” in International Conference on Engineering Applications of Neural Net-

works. Springer, 2016, pp. 213–226.

[132] K. S. K. Arumugam, I. A. Kadampot, M. Tahmasbi, S. Shah, M. Bloch, and

S. Pokutta, “Modulation recognition using side information and hybrid learning,”

in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium

on. IEEE, 2017, pp. 1–2.

[133] K. Triantafyllakis, M. Surligas, G. Vardakis, and S. Papadakis, “Phasma: An auto-

matic modulation classification system based on random forest,” in Dynamic Spec-

trum Access Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017,

pp. 1–3. Timothy J. O’Shea Chapter 6. Conclusion 191

[134] M. Laghate, S. Chaudhari, and D. Cabric, “Usrp n210 demonstration of wideband

sensing and blind hierarchical modulation classification,” in Dynamic Spectrum Ac-

cess Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp.

1–3.

[135] J. L. Ziegler, R. T. Arn, and W. Chambers, “Modulation recognition with gnu ra-

dio, keras, and hackrf,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE

International Symposium on. IEEE, 2017, pp. 1–3.

[136] K. Karra, S. Kuzdeba, and J. Petersen, “Modulation recognition using hierarchical

deep neural networks,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE

International Symposium on. IEEE, 2017, pp. 1–3.

[137] N. E. West, K. Harwell, and B. McCall, “Dft signal detection and channelization

with a deep neural network modulation classifier,” in Dynamic Spectrum Access Net-

works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.

[138] C. M. Spooner, A. N. Mody, J. Chuang, and J. Petersen, “Modulation recognition

using second-and higher-order cyclostationarity,” in Dynamic Spectrum Access Net-

works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.

[139] T. J. O’Shea and N. West, “Radio machine learning dataset generation with gnu

radio,” in Proceedings of the GNU Radio Conference, vol. 1, no. 1, 2016. Timothy J. O’Shea Chapter 6. Conclusion 192

[140] C. Weaver, C. Cole, R. Krumland, and M. Miller, “The automatic classification of

modulation types by pattern recognition.” STANFORD UNIV CALIF STANFORD

ELECTRONICS LABS, Tech. Rep., 1969.

[141] J. Aisbett, “Automatic modulation recognition using time domain parameters,” Sig-

nal Processing, vol. 13, no. 3, pp. 323–328, 1987.

[142] W. A. Gardner and C. M. Spooner, “Cyclic spectral analysis for signal detection and

modulation recognition,” in Military Communications Conference, 1988. MILCOM 88,

Conference record. 21st Century Military Communications-What’s Possible? 1988 IEEE.

IEEE, 1988, pp. 419–424.

[143] ——, “Signal interception: performance advantages of cyclic-feature detectors,”

IEEE Transactions on Communications, vol. 40, no. 1, pp. 149–159, 1992.

[144] C. M. Spooner and W. A. Gardner, “Robust feature detection for signal intercep-

tion,” IEEE transactions on communications, vol. 42, no. 5, pp. 2165–2173, 1994.

[145] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic modulation classi-

fication based on high order cumulants and hierarchical polynomial classifiers,”

Physical Communication, vol. 21, pp. 10–18, 2016.

[146] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation recognition

of communication signals,” IEEE Transactions on communications, vol. 46, no. 4, pp.

431–436, 1998. Timothy J. O’Shea Chapter 6. Conclusion 193

[147] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal classification

using spectral correlation and neural networks,” in New Frontiers in Dynamic Spec-

trum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium

on. IEEE, 2005, pp. 144–150.

[148] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale

image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[149] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in

python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.

[150] D. George and E. Huerta, “Deep neural networks to enable real-time multimessen-

ger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.

[151] S. Cioni, G. Colavolpe, V. Mignone, A. Modenini, A. Morello, M. Ricciulli,

A. Ugolini, and Y. Zanettini, “Transmission parameters optimization and receiver

architectures for dvb-s2x systems,” International Journal of Satellite Communications

and Networking, vol. 34, no. 3, pp. 337–350, 2016.

[152] M. Ettus and M. Braun, “The universal software radio peripheral (usrp) family of

low-cost sdrd,” Opportunistic Spectrum Sharing and White Space Access: The Practical

Reality, pp. 3–23, 2015.

[153] A. D.-R. A. T. AD9361, “url: http://www.analog.com/static/imported-

files/data sheets/ad9361.pdf (visited on 09/14/08),” Cited on, p. 103. \ Timothy J. O’Shea Chapter 6. Conclusion 194

[154] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings

of the 22nd acm sigkdd international conference on knowledge discovery and data mining.

ACM, 2016, pp. 785–794.

[155] N. E. West and T. O’Shea, “Deep architectures for modulation recognition,” in Dy-

namic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium on.

IEEE, 2017, pp. 1–6.

[156] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing

human-level performance on imagenet classification,” in Proceedings of the IEEE in-

ternational conference on computer vision, 2015, pp. 1026–1034.

[157] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-

houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the

IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[158] T. J. O’Shea, S. Hitefield, and J. Corgan, “End-to-end radio traffic sequence recogni-

tion with recurrent neural networks,” in 2016 IEEE Global Conference on Signal and

Information Processing (GlobalSIP), Dec 2016, pp. 277–281.

[159] R.-P. Weinmann, “Baseband attacks: Remote exploitation of memory corruptions in

cellular protocol stacks.” in WOOT, 2012, pp. 12–21.

[160] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber, “Lstm:

A search space odyssey,” IEEE transactions on neural networks and learning systems,

2017. Timothy J. O’Shea Chapter 6. Conclusion 195

[161] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,

vol. 9, no. 8, pp. 1735–1780, 1997.

[162] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recur-

rent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

[163] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural networks,”

arXiv preprint arXiv:1611.01576, 2016.

[164] A. Orebaugh, G. Ramirez, and J. Beale, Wireshark & Ethereal network protocol analyzer

toolkit. Syngress, 2006.

[165] A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” Online

Article http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.

[166] H. Kim and K. G. Shin, “In-band spectrum sensing in cognitive radio networks:

energy detection or feature detection?” in Proceedings of the 14th ACM international

conference on Mobile computing and networking. ACM, 2008, pp. 14–25.

[167] R. Ewerth, M. Springstein, L. A. Phan-Vogtmann, and J. Schutze,¨ “are machines

better than humans in image tagging?-a user study adds to the puzzle,” in European

Conference on Information Retrieval. Springer, 2017, pp. 186–198.

[168] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep learning for

identifying metastatic breast cancer,” arXiv preprint arXiv:1606.05718, 2016. Timothy J. O’Shea Chapter 6. Conclusion 196

[169] S. Sarraf, G. Tofighi et al., “Deepad: Alzheimer s disease classification via deep

convolutional neural networks using mri and fmri,” bioRxiv, p. 070441, 2016.

[170] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional net-

works for accurate object detection and segmentation,” IEEE transactions on pattern

analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.

[171] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,

real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2016, pp. 779–788.

[172] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD:

Single Shot MultiBox Detector. Cham: Springer International Publishing, 2016, pp.

21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0 2

[173] E. Blossom, “GNU radio: tools for exploring the radio frequency spectrum,” Linux

journal, vol. 2004, no. 122, p. 4, 2004.

[174] B. Sklar, Digital communications. Prentice Hall NJ, 2001, vol. 2.

[175] T. J. O’Shea, N. West, M. Vondal, and T. C. Clancy, “Semi-supervised radio signal

identification,” in Advanced Communication Technology (ICACT), 2017 19th Interna-

tional Conference on. IEEE, 2017, pp. 33–38.

[176] O. Chapelle and A. Zien, “Semi-supervised classification by low density separa-

tion.” in AISTATS, 2005, pp. 57–64. Timothy J. O’Shea Chapter 6. Conclusion 197

[177] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning (chapelle, o. et

al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20, no. 3,

pp. 542–542, 2009.

[178] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,” Synthesis

lectures on artificial intelligence and machine learning, vol. 3, no. 1, pp. 1–130, 2009.

[179] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics

and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.

[180] A. Hyvarinen,¨ J. Karhunen, and E. Oja, Independent component analysis. John Wiley

& Sons, 2004, vol. 46.

[181] B. Scholkopf,¨ A. Smola, and K.-R. Muller,¨ “Kernel principal component analysis,”

in International Conference on Artificial Neural Networks. Springer, 1997, pp. 583–588.

[182] L. Theis, W. Shi, A. Cunningham, and F. Huszar,´ “Lossy with

compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.

[183] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,

“Full resolution image compression with recurrent neural networks,” arXiv preprint

arXiv:1608.05148, 2016.

[184] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures

for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017. [Online].

Available: http://arxiv.org/abs/1707.07012 Timothy J. O’Shea Chapter 6. Conclusion 198

[185] T. Back¨ and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter

optimization,” Evolutionary computation, vol. 1, no. 1, pp. 1–23, 1993.

[186] G. Venter and J. Sobieszczanski-Sobieski, “Particle swarm optimization,” AIAA

journal, vol. 41, no. 8, pp. 1583–1589, 2003.

[187] A. Goldbloom, “Data prediction competitions–far more than just a bit of fun,” in

Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010,

pp. 1385–1386.