THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

NONLINEAR NEURON DISCRIMINANT FUNCTIONS FOR ALTERNATE DEEP LEARNING TRAINING ALGORITHMS

STEVEN PETRONE SPRING 2020

A thesis submitted in partial fulfillment of the requirements for baccalaureate degrees in Computer Engineering with honors in Computer Engineering

Reviewed and approved* by the following:

Christopher Griffin Associate Research Professor Applied Research Laboratory & Department of Mathematics Thesis Supervisor

Vijay Narayanan Professor of Computer Science and Engineering Honors Adviser

*Signatures are on file in the Schreyer Honors College. i

Abstract

In this work we present a novel neuron discriminant function that allows for alternate training algorithms for deep learning. The new neuron type, which we call a posynomial neuron, can be combined with linear neurons to represent functions that are exponentials when inferencing new data, but are only polynomials of the network weights. We show that the properties of these net- works can be resistant to the vanishing gradient problem. We also formulate training the network as a geometric programming problem and discuss the interesting benefits this can have over training a network with gradient descent, such as data set analysis and network interpretability. We provide a C++ library that implements both posynomial and sigmoidal networks but provides flexibility for additional novel layer types. We also provide a tensor library that has applications beyond deep learning. ii

Table of Contents

List of Figures iv

List of Tables v

Acknowledgements vi

1 Introduction 1 1.1 Deep Learning ...... 2 1.2 Contribution ...... 2 1.3 Organization ...... 2

2 Literature Review 3 2.1 Neural Networks ...... 4 2.1.1 Artificial Neurons from Biology ...... 4 2.1.2 Logistic Regression and Nonlinearities ...... 4 2.1.3 Hidden Layers and XOR problem ...... 5 2.1.4 Derivative Backpropagation as an Engine for Gradient Descent ...... 6 2.1.5 Survey of Advancements in Neural Networks ...... 7 2.2 Posynomials ...... 10 2.2.1 Monomials, Polynomials, Posynomials ...... 10 2.2.2 Geometric Programming ...... 10

3 Theory 13 3.1 Posynomial Neural Network ...... 14 3.1.1 Monomial Based Discriminant Function ...... 14 3.1.2 Transforming Input Data Before a Posynomial Network ...... 14 3.1.3 Derivatives for gradient descent ...... 18 3.2 Geometric and Convex Programming ...... 19 3.2.1 Formulation ...... 19

4 Implementation 21 4.1 Data Structures ...... 22 4.1.1 Tree ...... 22 4.1.2 List ...... 22 4.1.3 Tensor ...... 22 iii

4.1.4 Layer ...... 23 4.1.5 Network ...... 24 4.2 Algorithms ...... 26 4.2.1 Inference ...... 26 4.2.2 Train/Learn ...... 27 4.3 Verification and Bug Finding ...... 27 4.3.1 Eradicating Memory Leaks with Valgrind ...... 27 4.3.2 Input Fuzzing with American Fuzzy Lop ...... 28

5 Experimental Results 29 5.1 XOR With 3 Layer Network ...... 30 5.2 XOR with 2 Layer Network ...... 34 5.3 Training XOR using an objective function and a nonlinear solver...... 38 5.4 Run time Analysis ...... 39 5.4.1 Run time of Dot Product ...... 39

6 Conclusion 41 6.1 Discussion ...... 42 6.2 Future Directions ...... 42 6.2.1 Alternative to Gradient Descent ...... 42 6.2.2 Develop theory and training algorithm for complex value neural networks . 42 6.2.3 Layer analogous to the convolutional layer ...... 42 6.2.4 Library improvements ...... 43

Bibliography 44 iv

List of Figures

2.1 Diagram of single neuron ...... 5

3.1 Rotation of XOR data points ...... 15 3.2 Posynomial learnable saddle point ...... 16 3.3 Posynomial learnable saddle point contour plot ...... 16 3.4 Saddle point that computes XOR ...... 17 3.5 Saddle point contour plot that computes XOR ...... 17

4.1 UML class diagram ...... 25 4.2 Valgrind verifying absence of memory leaks ...... 28 4.3 AFL Fuzzing to identify bugs ...... 28

5.1 Diagram of 3 layer XOR network ...... 30 5.2 Initial weights of 3 layer XOR network ...... 30 5.3 3 layer XOR network showing correct outputs ...... 31 5.4 Near perfect outputs for 3 layer XOR network ...... 31 5.5 Transformation of inputs by first layer in 3 layer XOR network ...... 32 5.6 Posynomial layers of 3 layer XOR network ...... 32 5.7 Contour plot of posynomial layers of 3 layer XOR network ...... 33 5.8 Plot of 3 layer XOR network ...... 33 5.9 Contour plot of 3 layer XOR network ...... 34 5.10 Diagram of 2 layer XOR network ...... 35 5.11 Initial weights of 2 layer XOR network ...... 35 5.12 Correct outputs for 2 layer XOR network ...... 36 5.13 Perfect outputs for 2 layer XOR network ...... 36 5.14 Plot of 2 layer XOR network ...... 37 5.15 Contour plot of 2 layer XOR network ...... 37 5.16 Posynomnial XOR trained with nonlinear solver ...... 38 5.17 Contour plot of posynomial XOR trained with nonlinear solver ...... 39 5.18 Run time of dot product implementations for tensor library ...... 40 v

List of Tables

2.1 Definition of XOR ...... 6 vi

Acknowledgements

I would like to thank Christopher Griffin for his guidance with this work. 1

Chapter 1

Introduction 2

1.1 Deep Learning

Deep learning is a common cross disciplinary research topic. While commonly thought of as a subfield of computer science and engineering, deep learning and the study of neural networks was inspired by attempts to study the mammalian brain. Now that it is a well established and popular field, deep learning has found applications in a surprisingly diverse range of disciplines. Deep learning attempts to solve problems such as classification, regression, and even content generation by training a network of artificial neurons using large data sets. Individual neurons ‘fire’ based on a computation between their inputs, the weighted connections and an internal bias. It is the process of tuning these weights that motivates the term ‘learning’ and the multilayered structure of the neurons that motivates use of the term ‘deep’.

1.2 Contribution

The main contribution of this work is introducing a novel neural network layer type that is multiplicative in nature rather than linear. We show that this layer is a log transform, different from a linear layer with no activation function. When combined with a linear layer, a two layer network can resemble functions that compute multiplicative exponentials of the input when infer- encing. However, when the output is considered a function of the network weights it is a simple polynomial objective function, which allows for many novel optimization algorithms in addition to traditional gradient descent. We introduce some of these algorithms and their possible benefits to training speed, network interpretability, and security. We also discuss interesting properties of posynomial neural networks such as their resistance to the vanishing gradient problem. To train these networks we created a lightweight C++ library. The library includes many specialized data structures, including a generalized Tensor library that has applications beyond deep learning. Also included are traditional neural network data structures such as layers and networks and the neces- sary algorithms to train them.

1.3 Organization

In Chapter 2 we will review the literature to provide an overview of the history and develop- ment of deep learning and using neural networks to learn from data, as well as a survey of more recent advancements in the field. We will also discuss posynomials and signomials as classes of functions, and Geometric Programming, a specialized optimization technique for these classes. In Chapter 3 we will introduce the theory of our contribution by describing our novel neuron types and the required calculus to train them with gradient descent. Additionally, we will formulate an objective function to train a neural network with Geometric Programming. In Chapter 4 we will discuss the implementation details of the libraries we created as part of this work and the steps we took to validate their functionality, stability, and security. Chapter 5 provides our experimental results where we discuss training multiple posynomial networks on the classic XOR problem and provide graphical representations of the network that show an intuitive way of understanding how posynomial networks work. We conclude our work in Chapter 6 where we provide a discussion along with future directions. 3

Chapter 2

Literature Review 4

2.1 Neural Networks

Deep learning has had several periods of prolific research since its inception, with the most recent resurgence being in recent years. With recent advancements and discoveries, deep learn- ing is likely to forever change academic, industrial, government, and consumer technologies [1]. Deep learning is being implemented in IoT devices [2, 3] for computer vision, speech recogni- tion, and language processing from embedded devices to in ultra large scale data centers [4, 5] for training models from terabytes or more of data. Countless fields have discovered applications for deep learning including medical image processing [6], time series forecasting [7], autonomous vehicle driving [8], art generation [9], drug discovery [10, 11], language processing [12] and translation[13, 14]. Neural networks and deep learning have even been used to beat Chess and Go champions.

Deep learning involves training a network of artificial neurons with large data sets in an attempt to mimic how biological organisms learn from experience. In fact, much of the early research that led to the field of deep learning was an attempt to understand the brain.

2.1.1 Artificial Neurons from Biology Many ideas that led to the creation of artificial neural networks came from attempts to under- stand the brain. The 1943 paper by Walter Pitts and Warren McCulloch [15] was one of the first attempts at describing a brain’s mental activity in terms of neural networks that can be quanti- fied with propositional logic. A neural network can be thought of as a mathematical graph where vertices emulate neurons. The neurons can be considered to be binary having two states, or alter- natively are able to take on a range of values. The weighted edges of the graph emulate dendrites, where a larger weight indicates that an incoming signal is more likely to trigger the neuron to fire. Originally, Pitts and McCulloch described two types of neurons, peripheral afferents which were considered inputs, for example a signal over the optic nerve [16]. Today, these are simply called input neurons. The other class of neurons were the outputs, with each computed as a function based on the input signals from the peripheral afferents and the weights.

2.1.2 Logistic Regression and Nonlinearities Individual neurons are the elementary building blocks of neural networks and themselves can be considered their own small machine learning algorithm, the logistic regression, when used with sigmoidal activations. Logistic regression was first described by Cox in [17]. The neuron is given a vector of inputs, ~x. The properties of the neuron are its weight vector, ~w and bias b. The neuron’s output is computed in two steps. First, the discriminant function y is computed, which is the dot product of the weights with the inputs, plus the additive bias.

y = ~w · ~x + b (2.1)

Some sources refer to this as the logit [18], however we found this to be confusing considering the alternate logit definitions. At this point the neuron has only computed something linear. To discriminate against nonlinear data, a nonlinear activation σ(y) is applied. The original activation 5

is the sigmoid. 1 z = σ(y) = (2.2) 1 + e−y Where z is the output of the neuron. Other activations functions such as tanh(y) have been used, or the ReLU shown in equation 2.3.

 y y ≥ 0 z = (2.3) 0 y < 0

Alternatively, the leaky ReLU has a nonzero slope of α in the case that y < 0.

 y y ≥ 0 z = (2.4) αy y < 0

Figure 2.1: A neuron takes inputs ~x and uses weights ~w and bias b to compute the discriminant, then applies a nonlinear activation function to compute the output.

2.1.3 Hidden Layers and XOR problem The logistic regression can be used to represent many digital logic gates such as logical AND, OR, NAND, NOT. However a single logistic regression is unable to learn relations such as the equiv- alence relation, or the inverse equivalence relation XOR. A single line cannot discriminate between output ‘1’ or output ‘0’. 6

x1 x2 x1 ⊕ x2 0 0 0 0 1 1 1 0 1 1 1 0

Table 2.1: Definition of XOR.

The solution to learning XOR is to use a network of neurons by including a hidden layer of neurons. By having the input neurons pass to a hidden layer with multiple neurons which computes an intermediate function before passing to the output layer, complicated data sets can be properly classified. Cybenko showed in 1989 [19] that a feedforward neural network with at least one hidden layer that uses a sigmoidal activation function can approximate any closed and bounded function Rn → Rm with a nonzero amount of error. The amount of error is dependent on the number of hidden units [20].

2.1.4 Derivative Backpropagation as an Engine for Gradient Descent It is very difficult to choose the weights of networks larger than a few neurons to correctly model the target function. Rumelhart et. al [21] showed how to apply the chain rule to iteratively train a network to minimize the error using sample data. Training networks with deep hidden layers motivated the use of the term deep learning. Using randomly initialized weights, each weight can ∂E be adjusted proportional to its contribution to the error of the network, ∂w . To do this we begin by calculating the error of the network according to the sum of squared differences between the networks output and the expected output for each neuron zˆj and for each data point.

1 X X (d) (d) E = (z − zˆ )2 (2.5) 2 i i d∈taste i∈outputs

∂E We can use the chain rule to write ∂w in terms of derivatives that can easily be taken from Equations 2.1, 2.2, and 2.5.

∂E ∂E ∂z = i ∂wj,i ∂zi ∂wj,i ∂E ∂z ∂y = i i ∂zi ∂yi ∂wj,i Applying the derivatives of the error, activation function, and discriminant finds the contribution of each individual weight to the network error.

∂E X (d) (d) = zi − zˆi (2.6) ∂zi d∈dataset

−yi −yi ∂zi e 1 e = −y 2 = −y −y = zi(1 − zi) (2.7) ∂yi (1 + e i ) 1 + e i 1 + e i 7

∂yi = xj (2.8) ∂wj,i For a hidden layer finding ∂E is the same as finding ∂E for the outer layer. ∂zj ∂xj

∂E ∂E X ∂E ∂zi = = ∂zj ∂xj ∂zi ∂xj i∈outer layer

X ∂E ∂zi ∂yi = ∂zi ∂yi ∂xj i∈outer layer

Equations 2.6 and 2.7 can be used to backpropagate error to previous layers along with Equa- tion 2.9. ∂yi = wj,i (2.9) ∂xj Errors for the weights of the hidden layer can be used with ∂E the same way ∂E was used with ∂zj ∂zi Equations 2.7 and 2.8. Errors for deeper layers can be found by recursively using the methods for finding the errors of the first hidden layer. Finally, once a weight’s contribution to the network error was found, the weight can be adjusted by subtracting an amount proportional to the derivative of the error. ∂E w0 = w − · λ (2.10) ∂w Where λ is the learning rate of the network. Fortunately, most of the derivatives such as Equations 2.6, 2.7, 2.8, and 2.9 use already computed values, making training a network computational fea- sible. This process is known as gradient descent with a fixed step length.

Unfortunately, when networks have too many hidden layers, the error derivatives in the deepest layers are negligible because their magnitudes are hard to represent with the precision of modern computers, and can be orders of magnitude smaller than the derivatives in the outer layers. This phenomenon makes training deep networks challenging and is called the vanishing gradient prob- lem. Consider the flattening of the sigmoidal functions of the sigmoid and tanh functions. As yi approaches ±∞, the derivative zi approaches 0. When substituted in the chain rule, errors for yi previous layers are pushed to zero, making it hard to represent deeper layers’ error gradients on similar orders of magnitude to outer layers.

2.1.5 Survey of Advancements in Neural Networks Search Improvements The training algorithm we presented was a very basic approach that trains over entire datasets and adjusts the weights proportional to a fixed learning rate. In this case the fixed learning rate must be chosen by the implementer and is therefore a hyperparameter whose chosen value is critical for the success of the algorithm. Stochastic gradient descent instead iteratively chooses small batches of the training dataset and computes gradients upon the batch, then repeats the sampling process at random. With stochastic gradient descent the learning rate is also gradually decreased as the minimizing point is approached to account for the increased error in calculating the gradient from 8

a batch. The benefit is that gradients can be computed and applied much faster than if the entire dataset is considered when calculating error. With faster and thus more frequent weight updates, convergence is faster and does not grow with increasingly large data sets. We refer the reader to [22] for more discussion on stochastic search methods for network learning.

While stochastic gradient descent affects the direction of the gradient, other methods affect the size of the step when updating the weights. Momentum [23] involves each step being calculated as a combination of the current derivative and an exponentially decreasing component of the previous 0 ∂E derivatives. While a normal weight update resembles w = w − λ ∂w , weight updates with a mo- mentum approach follow Equations 2.11 and 2.12 with an additional hyperparameter µ resembling the decay factor of the historical gradients that ranges from 0 and 1. ∂E m0 = µm + λ (2.11) ∂w w0 = w − m0 (2.12) The weight updates from using momentum can be smoother and more consistent as it reduces variance from using stochastic gradient descent. An extended version of momentum is presented in [24] that incorporates Nesterov’s accelerated gradient method. There are also some algorithms that incorporate adaptive learning rates. AdaGrad[25] individually adapts learning rates for each network parameter using historical values from previous iterations. ADADELTA [26] is a similar algorithm that requires no tuning of learning rates and is robust to noisy gradients and selection of hyperparameters. RMSProp [27] modifies AdaGrad to use an exponentially weighted moving average. Adam[28] combines momentum with RMSProp and in [29] Adam is modified to use Nesterov momentum. There are also second order methods that attempt to use information about the second derivative of the error function to provide more insight to choosing an intelligent step size. Consult [30] for a more comprehensive review of line search algorithms and second order optimization methods.

Convolutional Neural Networks Convolutional neural networks[31] took inspiration from the visual cortex of the mammalian brain [32]. Instead of a simple matrix vector multiplication of the input vector with weights, a convolutional layer performs a convolution with the weights acting as a linear filter, or kernel. The standard definition of convolution is given in Equation 2.13, however neural networks have a countable number of input neurons, so the discrete case is used, shown in Equation 2.14. Addition- ally, convolutional layers are commonly associated with picture or video data where each neuron is mapped to a pixel, so the two dimensional discrete convolutional operator is given in 2.15 Z ∞ y(t) = (x∗w)(t) = w(τ)x(t − τ)dτ (2.13) −∞ ∞ X y(t) = (x∗w)(t) = w(i)x(t − i) (2.14) i=−∞ ∞ ∞ X X y(s, t) = w(i, j)x(s − i, t − j) (2.15) i=−∞ j=−∞ 9

Convolution can be though of as sliding a filter across a signal, and computing a dot product of the filter and signal as the value for the output at each offset of the filter. In convolutional networks on pictures, this can be a 3x3 or 5x5 filter over an image. Convolutional networks are very successful on image data because they can account for locality of features in an image. For example if a learned kernel resembles something similar to the horizontal Prewitt operator:

-1 0 1 -1 0 1 -1 0 1 then the feature map, or output of the convolution, will have high responses where there are hori- zontal edges on the image. Other learned kernels will learn other features. For color images with three channels, the kernel will be passed over all channels to have a 3 to 1 channel mapping. Con- volutional layers can also feature multiple kernels to increase the number of output channels and leverage learning multiple types of features in an image. The output layer is usually a standard feedforward neural network to learn the importance of each feature. In practice, convolutional networks are highly successful because of their ability to account for locality of image pixels, and because of the small number of required weights. Regardless of the resolution of the input im- ages, a 3x3 convolutional feature map will have 9 weights plus any required biases. This is closely related to the concept of weight sharing, where multiple neurons in a layer are dependent on the same set of weights, such as a convolutional kernel.

Selected Topics There are many other research topics related to neural networks and deep learning. Autoen- coders involve training a neural network to reproduce the given input vector and analyzing the internals of the network. Oftentimes the network is structured so the hidden layers get progres- sively smaller with the smallest layer being in the middle, then larger again until the outer layer. This accomplishes the task of data compression and dimensionality reduction. Hopfield [33] in- troduced Hopfield networks which used neurons with recursive connections and led to a unique sub field of deep learning. Recurrent neural networks are a generalized category of networks that feature recursive connections and are often used for sequence modeling or learning where the data points have uneven dimensions. Long Short Term Memory (LSTM) [34] are more complicated instances of recurrent neural structures that have their own memory cells and provide a way for gradient based training. Attention [35] is a powerful tool used with RNNs with natural language processing where the network is informed of certain text to give attention in a textual dataset. Graph neural networks [36] are a recently popular neural network paradigm that allows for the encoding of graph based structures. Aizenberg [37, 38] produced many results investigating com- plex and multi-valued neurons, and was able to train a single neuron to learn XOR with complex weights and a novel training algorithm. Generative adversarial networks [39] use a game theoretic approach of having a generator network compete against an adversary to generate novel informa- tion. 10

2.2 Posynomials

2.2.1 Monomials, Polynomials, Posynomials For the following definitions we will consider functions of ~w. This is because when training deep learning models we seek to find the value ~w0 that minimizes the objective function E.A monomial is a function of the form

n Y xi M(w) = c wi (2.16) i=1 Multiplication is allowed, including exponentiation (repeated multiplication), however addition of terms is not allowed. Polynomials, however, allow for addition, shown in Equation 2.17.

m n X Y xi,j P (w) = cj wi (2.17) j=1 i=1

If we enforce a positivity constraint on the coefficients ck then we have a posynomial. The literature also occasionally refers to signomials as an extension to polynomials where the exponents can be any real numbers and the variables can assume any nonnegative numbers.

m n + X Y xi,j P (w) = cj wi j=1 i=1 (2.18) s.t. cj ≥ 0

wi ≥ 0

While monomials are closed under multiplication and division, posynomials are closed under ad- dition, multiplication, and nonnegative scaling [40]. A posynomial can be divided or multiplied by a monomial and the result will be a posynomial.

2.2.2 Geometric Programming Clarence Zener introduced a specific tool for minimizing a specific class of polynomials, specif- ically posynomials where the number of terms was exactly one more than the number of variables [41]. Richard Duffin provided a duality theory to support this technique and extended it to allow for an arbitrary number of terms [42]. This technique was termed Geometric Programming because of the original use of the arithmetic-geometric inequality which states that the arithmetic mean is an upper bound to the geometric mean. 1 (V + V + ... + V ) ≥ V 1/nV 1/n ··· V 1/n (2.19) n 1 2 n 1 2 n By combining repeated terms it can be shown that the inequality holds for weighted means if m ~ X the weights sum to 1. For a vector δ, with δj = 1, the inequality takes the form of 2.20. j=1 11

δ1 δ2 δn δ1V1 + δ2V2 + ... + δnVn ≥ V1 V2 ··· Vn (2.20)

Next, we do a change of variables by letting Uj = δjVj

 δ  δ  δ U1 U2 Un U1 + U2 + ...Un ≥ ··· (2.21) δ1 1 δ2 2 δn n

Each Uj represents a monomial term of the function w that we seek to minimize. Minimizing the left side of Equation 2.21 is known as the primal program. By carefully selecting the weights ~δ the right size of Equation 2.21 can be made completely independent of the variables w. This is known as the dual program and the duality theory of Geometric Programming shows that when the orthogonal conditions are met the value of the maximum of the dual program is found and is equal to the minimum of the primal program. This minimum cost can then be used to find the decision variables ~w∗. Concretely written, ~δ must be chosen to satisfy the following normality and orthogonality conditions. m X δj = 1 (2.22) j=1 m X δjxi,j = 0, i = 1, 2, ..., n (2.23) j=1 Once the decision variables are no longer dependents of the geometric side of the equation, the optimal cost can be computed using the term coefficients and dual variables as shown in Equation 2.24.

c δ c δ c δ 1 2 ··· m (2.24) δ1 1 δ2 2 δm m The power of the duality theory of Geometric Program is to take a nonlinear and possibly non- convex program and find the solution by instead solving a set of linear equations [43]. When the number of terms is one more than the number of variables this is easily solved. When there are additional terms there are positive degrees of difficulty. It has been shown that optimal solutions can still be found with clever substitution [44]. The theory has also been extended to use con- straints. Posynomial geometric programming with the arithmetic-geometric mean inequality can be applied to any optimization program of the form in Equation 2.25 where there are m0 terms in the objective function, l constraints with each mk terms, and n decision variables.

m0 n X Y x0,j,i g0(~w) = c0,j wj j=1 i=1

mk X Y n x s.t. gk(~w) = ck,j i = 1 wj k, j, i ≤ 1, m = 1, 2, ..., l (2.25) j=1

ck,j ≥ 0

xk,j,i ≥ 0 12

Passy and Wilde [45] later generalized geometric programming to include signomial geometric programming which allows for less restrictive constraints and negative terms in the objective func- tion using signum functions σ which take on values of 0 or 1. The form of a signomial geometric program is shown in Equation 2.26. The generalization results in the loss of the applicability of the arithmetic geometric mean inequality and results in a nonconvex dual program. However, the technique is no worse than other nonconvex optimization techniques and allows for a systematic approach to finding local minima until a global minimum is found. Additional information on both posynomial and signomial geometric programming can be found in [46].

m0 n X Y x0,j,i g0(~w) = σ0,jc0,j wj j=1 i=1 m Xk Y s.t. g (~w) = σ c i = 1nwxk, j, i ≤ σ , m = 1, 2, ..., l k k,j k,j j k (2.26) j=1

ck,j ≥ 0

xk,j,i ≥ 0

σk,j, σk = ±1 Additional advancements on geometric programming and polynomial optimization include work by Li and Change [47] who developed a logarithmic piecewise linearization method for optimizing polynomial programs. Huang and Kao [48] improved this for the case of posynomials and geometric programming problems by reducing the number of variables required. Refer to [49] for many examples of formulating practical problems as geometric programming problems. In addition to providing a fast and computationally easy way to minimize functions, geometric pro- gramming has other useful features. Analysis of the dual weights give indication how much each term contributes to the total cost, even independent from the term coefficients. Sensitivity analy- sis is often used in geometric programming to determine how noise or imperfect data can affect the model. Finally, extensions of geometric programming include techniques such as functional substitution to extend the tool to wider classes of functions. 13

Chapter 3

Theory 14

3.1 Posynomial Neural Network

3.1.1 Monomial Based Discriminant Function A neuron’s discriminant is traditionally computed as the inner product of its inputs and its weights. A nonlinear activation function is then applied elementwise to the inner product. The main contribution of this work is to define a neuron that does not require a composition of two functions. This creates a neuron that is potentially resistant to the vanishing gradient problem and opens the possibility for alternative training algorithms to gradient descent. Other benefits include a network with improved interpretability. We define the discriminant as a monomial function of the weights where the inputs are the powers to which the weights are raised. A multiplicative bias is also applied. Y xi yj = cj · (wi,j) ∀i (3.1)

s.t. wi,j ≥ 1 To avoid negative values raised to odd exponents, a positive constraint is placed on the weights of a monomial neuron. Combining a layer of monomial neurons with a layer of linear neurons will create a network that models generalized polynomials, or ‘posynomials’ as the literature often refers to polynomials with such constraints. Here the multiplicative bias c is absorbed by the weights of the linear layer, which we will denote with aj,k to distinguish between the weights of the multiplicative layer. X Y xi yk = aj,k · (wi,j) + bk ∀j ∀i (3.2)

s.t. wi,j ≥ 1

The activation function for each neuron is the identity function. The attentive reader will notice that applying a log transform on equation 3.1 will reduce the neuron to a linear neuron. ! Y xi X log c · wi = log(c) + xi · log(wi) ∀i ∀i

At this point the logs that are applied to the weights reduce to other learnable parameters. While not necessarily practical, this transform could be realized through an additional layer or activation function in the network.

3.1.2 Transforming Input Data Before a Posynomial Network In some cases mapping input to output vectors may not be possible with only posynomials. However, by first applying a transformation, the input vector’s components can be scaled such that discrimination via posynomials is possible. The simplest example of this is using a rotation matrix. In two dimensions this is simply

 0      x1 cos(α) − sin(α) x1 0 = · x2 sin(α) cos(α) x2 15

In three dimensions there are three axes that a point can be rotated on. Thus, the resulting rotation matrix is the result of the multiplication of three 3x3 rotation matrices [50].

 0          x1 cos(α) − sin(α) 0 cos(β) 0 sin(β) 1 0 0 x1 0 x2 = sin(α) cos(α) 0  0 1 0  0 cos(γ) − sin(γ) x2 = 0 x3 0 0 1 − sin(β) 0 cos(β) 0 sin(γ) cos(γ) x3 cos(α) cos(β) cos(α) sin(β) sin(γ) − sin(α) cos(γ) cos(α) sin(β) cos(γ) + sin(α) sin(γ) sin(α) cos(β) sin(α) sin(β) sin(γ) + cos(α) cos(γ) sin(α) sin(β) cos(γ) − cos(α) sin(γ) − sin(β) cos(β) sin(γ) cos(β) cos(γ)   x1 · x2 (3.3) x3

In the general case, we can generate an n × n rotation matrix based on an n dimensional angle vector by multiplying n individual rotation matrices. See [51] for an algorithm to generate n di- mensional rotation matrices.

As an example of how the use of a rotation matrix extends the versatility of posynomial networks, π consider applying a 8 rotation on the XOR problem.

π Figure 3.1: Rotation of XOR data points around origin by 8

As this aligns the data points on the axes of the basis vectors, it is easy for a posynomial network to learn a saddle point such as

f(~x) = (.5x1 + 2x1 ) − (.5x2 + 2x2 ) 16

Figure 3.2: A saddle point that can be learned by a posynomial.

Figure 3.3: The contour plot of a saddle point that can be learned by a posynomial.

Applying the rotation gives the following result.

x cos( π )−x sin( π ) x cos( π )−x sin( π ) x sin( π )+x cos( π ) x sin( π )+x cos( π ) f(~x) = (.5 1 8 2 8 + 2 1 8 2 8 ) − (.5 1 8 2 8 + 2 1 8 2 8 ).

Shown below is the plot of this function and its ability to compute XOR. 17

Figure 3.4: A saddle point function to compute the XOR data points.

Figure 3.5: A contour plot of a saddle point function that computes the XOR data points.

We can relax the assumption that transformations must be done with a rotation matrix. A general n × n matrix has more flexibility, however increases the number of learnable parameters by a power of 2. Further generalization can add biases and change dimensionality, resulting in the equivalent of adding a linear neural network layer with no activation function. However, because of 18 algebraic exponent properties, a separate layer is not necessary. Consider the same XOR function written in a different form:

cos( π ) x − sin( π ) x cos( π ) x − sin( π ) x sin( π ) x cos( π ) x sin( π ) x cos( π ) x (.5 8 ) 1 (.5 8 ) 2 + (2 8 ) 1 (2 8 ) 2 − (.5 8 ) 1 (.5 8 ) 2 − (2 8 ) 1 (2 8 ) 2

Thus, adding depth to a posynomial network can often be equivalent to adding breadth, which simplifies gradient descent and is one way that the vanishing gradient problem is avoided. The other is with nonzero derivatives as w → ∞. Computing data transformations this way prior to posynomial computation also preserves the posynomial structure, allowing for alternate training algorithms such as convex or geometric programming.

3.1.3 Derivatives for gradient descent We will now show the necessary derivatives that are needed for training a posynomial network with gradient descent. We begin with the loss function that we seek to minimize, in this case the L2 norm. 1 X X (n) (n) E = (y − yˆ )2 (3.4) 2 j j ∀i ∀j Here n is indexing data points. We would like to adjust each weight w by the derivative of the ∂E error with respect to that weight, ∂w . First we take the derivative of the error with respect to the outputs z. ∂E = y − yˆ (3.5) ∂y From here we can apply the chain rule. ∂E ∂E ∂y = · j (3.6) ∂w ∂yj ∂wi,j where ∂y j (xi−1) Y xk = xi · (wi,j) · cj · (wk,j) ∂wi,j ∀k6=i yj xi−1 (3.7) = xi · (wi,j) · xi wi,j xi = yj · wi,j For networks with layers deeper than the monomial layer it is also necessary to know ∂E as ∂xi this becomes the next vector of ∂E derivatives for the next layer. ∂yj

∂E X ∂E ∂yj = · (3.8) ∂xi ∂yj ∂xi ∀j where ∂yj = yj · ln(wi,j) (3.9) ∂xi 19

3.2 Geometric and Convex Programming

The posynomial structure of the computed function can be trained with convex or geometric programming algorithms as an alternative to gradient descent. We will formulate the objective function in terms of the L1 and L2 norms and show additional detail for geometric programming. For convex programming algorithms, see [52].

3.2.1 Formulation

Suppose the L1 norm is chosen as the loss function. We then define our objective function in (d) Equation 3.10 where d indexes sample data points and k indexes output neurons. For example, yˆk is the target value for the kth output neuron for data point d.

X X (d) (d) min yk − yˆk ~w ∀d ∀k (3.10) (d) (d) s.t. yk ≥ yˆk

While an absolute value is normally required for the L1 norm, it is redundant because the constraints force a strictly nonnegative objective function. While Equation 3.10 will give the true error, the minimizing point ~w0 will be the same if we ignore the constants.

X X (d) min yk ~w ∀d ∀k (3.11) (d) (d) s.t. yk ≥ yˆk With some optimization algorithms, the constraints must be stated such that a certain quantity is less than or equal to 1. We can use simple algebraic manipulation of the current constraints to reformulate equation 3.11 as

X X (d) min yk ~w ∀d ∀k (d) (3.12) yˆk s.t. (d) ≤ 1 yk This is a polynomial with a term for each output neuron for each data point in the training set. Note that this requires that each output must be strictly larger than the target, such that error can only occur in one direction from the desired values. A better fitting set of parameters may be learned if the L2 norm is instead minimized, which would allow error in either direction.

X X  (d) (d) 2 min (yk − yˆk ) (3.13) ~w ∀d ∀k This minimizes the sum of squared differences between the output values and their targets. Expanding the squared terms allows us to represent the objective function strictly as a polynomial and ignore any constants, which do not effect the minimizing point. 20

X X (d) 2 (d) (d) min (yk ) − 2yk yˆk (3.14) ~w ∀d ∀k The next component needed when formulating the training of a posynomial neural network is the addition of signum function. Note that this technically makes a signomial neural network. Consider a posynomial layer followed by a linear layer, with the weights of the linear layer de- termined to be either positive or negative. This can be done with a preset signum function σj,k to accompany each linear weight and bias. Combining the signum functions with Equations 3.2 and 3.14 we can define the signomial geometric program.

" !2   ~ X X X Y xi min E~x,~yˆ ~w, b = σk,jak,j wj,i + σk,0bk ~w,~b ∀d ∀k ∀j ∀i !# (d) X Y xi − 2ˆyk σk,jak,j wj,i + σk,0bk ∀j ∀i (3.15)

s.t. ak,j ≥ 0

wj,i ≥ 0

bk ≥ 0

σk,j = ±1 Learning representations with geometric programming could have many benefits, the obvious being possible speed improvements over gradient descent as geometric programming provides a systematic way for finding minima. Additionally, analysis of the dual variables could help deep learning practitioners understand data sets because the dual variables act as weights for the terms of the objective function which are directly tied to specific data points in the dataset. Understanding why some data sets might be weighted more than others could provide insight on which data points are more useful than others when creating better, more concise data sets. This weight analysis could also help the security of neural nets by creating more uniform data sets where the dual variables have little variances, which could possibly indicate that the data points have uniform effect on the learned representation. 21

Chapter 4

Implementation 22

4.1 Data Structures

The following data structures were implemented in C++ as template classes for most data types. As template classes, they are each contained within .h header files and can be included by other header files for .cpp files. Compilation was done with GNU GCC compilers [53].

4.1.1 Tree Although many libraries currently exist that implement trees, a specialized binary tree class was designed for the required functionality of the tensor memory tracking data structure to keep the library as self contained as possible. The tree template class contains two template types, K for the key type and D for the data type. The key is used for indexing and search, while the data holds whatever information must be stored by the node. The Node is a nested class within the Tree class, also of template types K and D. A Tree contains a pointer to the root node, and each Node contains pointers to left and right nodes as well as a key and data. Nodes can be added or deleted, and fetched by either key or data. Additionally, a function was implemented to retrieve the biggest key less than or equal to the given key. This was useful for tracking memory allocations in the Tensor class.

4.1.2 List A doubly linked list template class was also implemented for self containment to be used by the Network class. The class contains two pointers to the head and tail Nodes as well as a counter for the number of nodes. The nested Node template class contains pointers to the next and previous Nodes as well as a data value of the template type. Functions were primarily written for stack and queue functionality, adding, removing, or fetching nodes at the head or tail. Fetch functions return the Node for traversal by the user.

4.1.3 Tensor A lightweight Tensor library was implemented to have low metadata overhead, provide tensors of arbitrary dimensions, and easy indexing with sequenced [] notations. The Tensor template class includes only 3 members, a number of dimensions, a pointer to an array storing the size of each dimension, and a pointer to the tensor elements. A useful syntactic ability of the tensor is its subscripting. Subscripting an n-dimensional tensor will return an (n − 1)-dimension tensor. The only quirk with maintaining generalizability is that indexing a one-dimensional tensor (a list) will return a 0-dimensional element with a single element, and not the element itself. To make accessing individual elements easier, the ∼ operator is used to retrieve the element of a single-element tensor, or simply the first element of any tensor. For example, to read and write elements in a list using Tensors, 1|Tensor mytensor = Tensor(1, 4); // List of 4 ints 2|˜mytensor[2] = 3; // Assign the third element to 3. 3|int x = ˜mytensor[3]; // Get the third element. 23

Tensor malloc tracker When a new Tensor is created, space is allocated on the heap. When Tensors are subscripted off the parent Tensor, pointers are simply made to within the same chunk of memory. As it is un- known whether parent or child will be deleted first, it can be unclear when or if to free a Tensor’s memory, or what the base address of the allocated chunk was. To navigate these issues, allocated chunks are tracked in a Tree data structure. Two global Trees are shared by all Tensors, for the dimensions arrays and for tensor storage. Upon allocation of a new chunk, a node is added to the Tree with the key being the allocated address, and the data being a counter initialized to 1. When Tensors are subscripted using the same memory chunks, the Node is retrieved by finding the largest key less than or equal to the given pointer values. The data counter is then incremented. When a Tensor’s destructor is called, the same procedure is used for retrieving the Nodes, but the counter is decremented. If the counter reaches 0, there are no references to the corresponding memory chunk and the base address can be freed.

An additional header file is included to deal with Tensors, tensorutils.h. This file mainly includes I/O functions, such as functions for converting IDX files to SDF files. IDX files are a data format file for storing arrangements of data in an arbitrary number of dimensions. The format of the file is shown below [54]. 32-bit integer magic number 32-bit integer size of dimension 1 32-bit integer size of dimension 2 ...... 32 bit integer size of dimension n data Where the magic number encodes information such as the data type of the elements and the number of dimensions. While IDX files are commonly used for compactly storing data sets like the MNIST or Fashion MNIST, the arrangement of the data is in big endian. For better compatibility with x86 CPUs, a data format SDF was created accomplish similar goals to the IDX format but was tailored specifically to the Tensor library and stores data in little endian format for faster and easier reading and writing. The format of SDF is very similar to IDX, however the data is populated with a size t sized magic number, along with size t sized numbers for number of dimensions followed by the size of all the dimensions. The data is then written as a contiguous chunk of bytes.

4.1.4 Layer Layers are building blocks of neural networks. The layer.h describes an interface with vir- tual functions that implementing classes must implement. Layer classes have an input size and out- put size variable, and five Tensor pointers. An input vector pointer is a reference of where to access the input. The weight, output, weight der, and backpropagation der pointers have obvious functionality and memory allocation for these pointers and are the Layer’s responsi- bility. Layer classes must also implement the virtual functions, mainly the y discriminant function, the z activation function, the dzdy derivative of the activation function, the dydw derivative to 24 update the weights, and the dydx derivative to backpropagate errors.

There are currently three classes that implement the Layer interface. The Linear class has no activation function and a discriminant that computes a vector of inner products of the input vector with each row vector in the weight matrix. The derivative functions are programmed accordingly. The Sigmoid class has the same discriminant as the Linear class but with a Sigmoidal activation function. The Posynomial class computes Equation 3.1 as its discriminant function and has no ac- tivation function. The reader should note that this is a misnomer, as a Posynomial layer really only computes a layer of monomials, however in practice a Posynomial layer was always followed by a Linear layer which essentially computes a vector of posynomials. The name Posynomial was also kept because of the positive weight constraints that avoid negative bases raised to odd exponents.

Utility functions for Layers are also supplied in the layerutils.h file. This includes functions that are not class methods but still relevant to operations involving Layers, such as intializing the weight matrices. Initialization functions were included to perform constant weight initialization for a given template type and template parameter (this function is mainly for debugging purposes) and uniform weight initialization for a given template type and template parameters giving the bounds of the uniform distribution. It should be noted that currently the bounds must be integers as template parameters in C++ do not support floating point precision as compilation becomes am- biguous when deciding to generate code depending on whether two floating point parameters are equal.

4.1.5 Network The Network class is the glue code that stitches together components of a neural network into one coherent object. The only member is a List of Layers, which can include a combination of dif- ferent layer types. Methods are provided for adding layers, inferencing, learning, training (learning over entire data sets), and testing.

A UML diagram of the provided data structures is shown in Figure 4.1. 25

Figure 4.1: UML class diagram of implemented classes 26

4.2 Algorithms

The Network and Layer classes build upon the specialized data structures to provide interfaces to the user to train and predict a tensor of weights given tensors of inputs and outputs. This allows a user to easily train data sets and predict new data with very little code and without having a deep understanding of the provided data structures by the C++ standard template library or by this project.

4.2.1 Inference Both Layer and Network classes implement an inference. At the Layer level, a reference to an input tensor-vector is given along with a boolean value on whether to compute the derivatives (used if in training mode). Then the following algorithm is run. Here, the functions y(), z(),

Algorithm 1 Inference Algorithm - Layer 1: procedure INFERENCE(*input layer, compute derivatives) 2: this->input layer ← input layer 3: y() 4: if compute derivatives then 5: dydx() 6: dydw() 7: z() 8: if compute derivatives then 9: dzdy() return this->output layer

and derivative functions dydw, dydx, and dzdy are all virtual functions in the Layer class to be implemented in specific instances. Developers can easily expand this library by using the simple interface, or even make more complicated layers by overriding the inference function altogether. The y() and z() functions are the discriminant and activation, respectively. The derivative func- tions compute the corresponding derivatives that are chained together during backpropagation, explained in the next section. It is the responsibility of the developer to write derivative functions that correctly implement the derivatives of the described feedforward functions. If inference is being called purely for inferencing without update, the boolean parameter can be set to false to avoid unnecessary work.

Inference at the Network layer is very simple. The parameters are identical to that of a single layer. The function itself is simple. As the layers of the Network have references stored in a list, the list is traversed. Inference is called for each Layer (with the initial input given) and the refer- ence to the output of each Layer is passed to the input of the next Layer. The output of the final Layer is the predicted vector. 27

4.2.2 Train/Learn The training and learning algorithms allow weight tensors for each layer of the network to be learned given a pair of input and output tensors. The train function for the Network class simply calls the learn function repeatedly for each data point in the given dataset (and thus training for 1 epoch). The train function then can be repeatedly called until a minimum error is achieved. Error can be measured with the test function, which also takes an input tensor-matrix and output tensor-matrix representing the test dataset and test labels. The error is returned.

The learn function for the Network class is essentially the reverse of the inference func- tion. The input is the network error as a vector, and the Layer list is traversed backwards, with each error being passed to the Layer learn function. The outputted error is backpropagated much like how the activations are passed forward during inference.

At the Layer level implementation, there are two parts to the learn function. First, the weights of the current layer must be updated according to the error. A double nested loop is used to access each element of the weight tensor-matrix. They are modified by subtracting the corresponding weight derivative (calculated during the most recent inference in dydw and dzdy) multiplied by the output error, dEdz passed to the function. A multiplicative learning rate is also applied. The second part of the Layer level learn function involves computing the backpropagated error for the previous Layer. The Tensor-vector *backpropagation der which was partially created during dydx and dzdy is finalized with an elementwise multiplication with the sum of the ele- ments of the error. This is then the new error that is passed back to previous Layers.

4.3 Verification and Bug Finding

The code was compiled with the GNU compiler for C++, g++ [53]. Test programs were made to make calls to all data structures and methods. Compilation was done at multiple levels of optimization provided by g++ as different levels revealed errors that others did not, such as bad memory references leading to segmentation faults.

4.3.1 Eradicating Memory Leaks with Valgrind Valgrind [55, 56, 57] is a framework for dynamic analysis of software. It can be used to analyze the address space of a program to detect memory leaks, memory management bugs, and threading bugs. Several example programs were created using the tensor and neural network libraries. These example programs include inferencing and training of neural networks, tensor reading, writing, and operations, and smart pointer tracking. Valgrind reported memory leaks which were subsequently removed by ensuring that any use of malloc or new corresponded to a single call to free or delete respectively. 28

Figure 4.2: Valgrind successfully verified that memory leaks are nonexistent in the written soft- ware.

4.3.2 Input Fuzzing with American Fuzzy Lop Fuzz testing involves repeatedly supplying randomized inputs to a program to attempt to un- cover bugs with nontrivial inputs that crash the program. While black box fuzzing is completely randomized and unintuitive, coverage-based fuzzing or grey-box fuzzing involves analyzing the edges of the execution graph and uses intuitive guesses that try to cover all edges of the program. American Fuzzy Lop (AFL) [58] uses an initial template as the format of the input, then employs a genetic algorithm and a proprietary C and C++ compiler to uncover new program edges.

A program was written that takes an input tensor file and passes it to a randomized network of linear and posynomial layers. The random tensor file is supplied by the AFL fuzzer.

Figure 4.3: American Fuzzy Lop fuzzing the libraries to identify bugs.

Bugs were identified regarding input sanitation and bounds checking, among others. Many were fixed however some are still in development. 29

Chapter 5

Experimental Results 30

5.1 XOR With 3 Layer Network

The first experiment we will present in this thesis uses the implemented library to learn the XOR problem with linear and posynomial layers. A network was defined with a linear layer, followed by a posynomial layer, with a final linear output layer. A diagram of this network is shown in Figure ??.

Figure 5.1: Diagram of 3 layer XOR network.

This was trained by initializing the linear layers with a uniform random distribution over the interval [0, 1], and the posynomial layer with a uniform random distribution over the interval [1, 2] as to avoid weights near 0. The learning rate was 0.008. The reader should note that intuition behind learning rates does not transfer from sigmoidal networks to posynomial networks as the derivatives are different. The initial weights are shown in Figure ??.

Figure 5.2: Initial weights of 3 layer XOR network.

After 27,000 epochs the outputs of the network can be considered correct, as they would clas- sify the correct integer if a nearest integer classification scheme is used. 31

Figure 5.3: 3 layer XOR network showing correct outputs.

After around 33,000 epochs the output of the network is near perfect when compared to the desired values.

Figure 5.4: Near perfect outputs for 3 layer XOR network.

The first layer computes the following transformation on the inputs. 32

Figure 5.5: Transformation of inputs by first layer in 3 layer XOR network.

The posynomial layers then produce the correct outputs.

Figure 5.6: Posynomial layers of 3 layer XOR network. 33

Figure 5.7: Contour plot of posynomial layers of 3 layer XOR network.

After the transformation, the posynomial layers are successfully able to discriminate between equivalent and nonequivalent inputs. Combining all three layers gives the following function.

Figure 5.8: Plot of 3 layer XOR network. 34

Figure 5.9: Contour plot of 3 layer XOR network.

At this point the ability of a 3 layer posynomial network to discriminate nonlinear data sets should be apparent.

5.2 XOR with 2 Layer Network

The depth of the 3 layer network can be represented with additional breadth in a 2 layer net- work. We trained a network with a single hidden layer of 4 posynomial neurons followed by a single layer of linear neurons. 35

Figure 5.10: Diagram of 2 layer XOR network.

The same initialization distributions and ranges were used as before. The learning rate was also 0.008.

Figure 5.11: Initial weights of 2 layer XOR network.

With fewer layers the network trained much faster, achieving correct results after fewer than 1,000 epochs. 36

Figure 5.12: Correct outputs for 2 layer XOR network.

After 2,000 epochs perfect results are achieved.

Figure 5.13: Perfect outputs for 2 layer XOR network.

Converting depth to breadth in a posynomial network is therefore shown to lead to significantly faster training times for an identical polynomial structure. The plots of the learned function are shown below. 37

Figure 5.14: Plot of 2 layer XOR network.

Figure 5.15: Contour plot of 2 layer XOR network. 38

5.3 Training XOR using an objective function and a nonlinear solver.

A XOR network was trained with a nonlinear solver as an alternative to gradient descent. First, an objective function was defined based on the 3 layer XOR network described earlier. Any redundant weights were merged for brevity. The objective function was defined in MATLAB and fmincon was used to minimize it. The objective function is shown below.

w8·0+w9·0 w10·0+w11·0 w8·0+w9·0 w10·0+w11·0 2 min(w1 · w4 · w5 + w2 · w6 · w7 + w3) x +(w · ww8·0+w9·1 · ww10·0+w11·1 + w · ww8·0+w9·1 · ww10·0+w11·1 + w − 1)2 1 4 5 2 6 7 3 (5.1) w8·1+w9·0 w10·1+w11·0 w8·1+w9·0 w10·1+w11·0 2 +(w1 · w4 · w5 + w2 · w6 · w7 + w3 − 1) w8·1+w9·1 w10·1+w11·1 w8·1+w9·1 w10·1+w11·1 2 +(w1 · w4 · w5 + w2 · w6 · w7 + w3) After 3000 iterations the weights were produced in MATLAB. Visualizing the learned function shows a correct XOR classifier.

Figure 5.16: Posynomial XOR function trained with MATLAB’s fmincon function. 39

Figure 5.17: Contour plot of posynomial XOR function trained with MATLAB’s fmincon func- tion.

Posynomial networks are unique in that they represent sums of products of exponentials for the inputs ~x, however learning the weight matrices is simply minimizing a polynomial objective function. This reduces the vanishing gradient problem when training with gradient descent with nonzero derivatives and networks where depth can be represented as breadth. Additionally, the polynomial objective functions can be trained with nonlinear solvers and convex or geometric programming algorithms, opening up new possibilities in deep learning.

5.4 Run time Analysis

5.4.1 Run time of Dot Product To evaluate the Tensor data structure’s user capabilities and built in algorithms, the CPU time to compute a dot product of two vectors was measured. This was compared for the supplied method dot product and a dot product implementation written for the Tensor library utilizing the over- loaded subscripting capabilities, which is most likely how a user would compute the dot product using the library but without an optimized function. Additionally, these results were compared to code written to compute the dot product of two malloc’d arrays. 40

Figure 5.18: Run time of dot product algorithms in milliseconds.

The built in dot product library performs as well as an array based dot product. However, the subscripting operators includes significant overhead when performing function calls instead of a single memory access, thus resulting in significantly higher run times. While the Tensor subscripting provides convenience and safety, using the provided algorithms included in the library will yield a more optimized algorithm. Recommended uses of Tensor subscripting include slicing lower dimensional Tensors out of higher dimensional Tensors or single to low use cases that will not affect algorithmic complexity. For tensor algorithms not supplied by the library, they can either be added to the header file, or the memory allocations for the data can be accessed and modified like a normal malloc’d chunk of memory with negligible overhead. Benefits of using the Tensor library include bounds checking and other safety features, already implemented algorithms, and syntactic convenience. 41

Chapter 6

Conclusion 42

6.1 Discussion

In this work, we developed the theory for learning representations of data with posynomial structured networks. We derived the weight update equations for gradient descent on posynomial layers and proposed alternate optimization algorithms such as geometric programming or general convex programming. Next, we implemented a lightweight, flexible, high performance neural network library that could implement both sigmoidal and posynomial networks and is easily ex- pandable to allow for additional layers and algorithms. Unit tests were used to verify the function of the library, Valgrind was used to assess the memory management, and fuzzing analysis was done with American Fuzzy Lop. We showed that XOR can successfully be trained on a 3 layer posyn- omial network with bounded width of 2, and that depth in a posynomial network can be instead transformed to additional depth. As the derivatives of the posynomial networks do not generally flatten, and because depth can be represented as additional width in a single hidden layer, posyno- mial networks are resistant to the vanishing gradient problem when gradient descent is the chosen optimization algorithm. Finally, we used MATLAB to show that nonlinear solvers can be used to rapidly train a network by representing the network error as a polynomial objective function.

6.2 Future Directions

6.2.1 Alternative to Gradient Descent Additional research could be done to implement polynomial programming algorithms for learn- ing the parameters of these networks. Specifically, geometric programming was investigated for this work however additional work must be completed to circumvent the inevitable problem of having high degrees of difficulty and limiting the number of parameters that geometric program- ming introduces. A possible solution could be to finish defining a divide and conquer algorithm for geometric programming that would solve these problems and also allow data sets to be trained iteratively and allow for the addition of new data sets without retraining the entire network.

6.2.2 Develop theory and training algorithm for complex value neural net- works A two layer, posynomial-linear, network that allows for complex valued weights could be trained to learn certain transforms like the Fourier transform. While this would be inefficient as all weights would be training to the same thing, arbitrary transforms could be learned on the data that can be accomplished using summations of complex valued polynomials. Additionally, inves- tigation could be done on training complex polynomial networks to compute the inputs’ Fourier transforms. Perhaps these networks could be studied in similar ways to autoencoders where the in- ternal structure is studied for generating various encodings of input data or for generative methods.

6.2.3 Layer analogous to the convolutional layer It is unclear what the analogous convolutional operator would be for posynomial networks. Perhaps convolution could be used prior to passing to posynomial-linear layer combinations in- 43 stead of a sigmoidal layer. Alternatively, nonlinear filtering layers could be developed. [59] used non-linear convolutional filters including Volterra based convolution. We believe that by modify- ing a posynomial layer to perform a convolutional-like operation, then using recurrent connections to iteratively perform convolution while passing each iteration out to a linear layer to sum the re- sults, then a Volterra series may be emulated. We leave determining how to train a network as well as determining the practicality of such a network up to future work.

6.2.4 Library improvements The library, while functional, is far from complete. The gradient descent algorithm is still basic, and could benefit from using more advanced line search algorithms such momentum and those discussed in the literature review. Additionally, most modern deep learning is done with the aid of hardware accelerators to exploit the inherent parallelism of training a network. CUDA could be used for interfacing with Nvidia graphics cards, or alternatively more general libraries like OpenCL or OpenACC could be implemented for use with graphics cards, FPGAs, multicore processors and even ASICS. 44

Bibliography

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436– 444, 2015. [2] He Li, Kaoru Ota, and Mianwiong Dong. Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE Network, 32:96–101, January 2018. [3] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sourour, and Mohsen Guizani. Deep learning for iot big data and streaming analytics: A survey. IEEE Communications Surveys & Tutorials, 20:2923–2960, 2018. [4] Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Flowers, Karin Strauss, and Eric S Chung. Toward accelerating deep learning at scale using specialized hardware in the datacenter. In 2015 IEEE Hot Chips 27 Symposium (HCS), pages 1–28, 2015. [5] Norman P Jouppi, Flicc Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter perfor- mance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017. [6] Muhamad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApps, pages 323– 350. Springer, 2018. [7] Xueheng Qui, Le Zhang, Ye Ren, Ponnuthurai N Suganthan, and Gehan Amaratunga. En- semble deep learning for regression and time series forecasting. In 2014 IEEE symposium on computational intelligence in ensemble learning (CIEL), pages 1–6. IEEE, 2014. [8] Brody Huval, Tao Wang, Sameep Tandon, Jeff Kiske, Will Song, Joel Pazhayampallil, Mykhaylo Andriluka, Pranav Rajpurkar, Toki Migimatsu, Royce CHeng-Yue, et al. An em- pirical evaluation of deep learning on highway driving. arXiv preprint arXiv:1504.01716, 2015. [9] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xi- aodong He. Attngan: Fine-graned text to image generation with attentional generative ad- versarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018. [10] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. The rise of deep learning in drug discover. Drug discovery today, 23(6):1241–1250, 2018. 45

[11] Erik Gawehn, Jan A Hiss, and Gisbert Schnieder. Deep learning in drug discover. Molecular informatics, 35(1):3–14, 2016.

[12] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3):55–75, 2018.

[13] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.

[14] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trnaslation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[15] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.

[16] Jerome Y Lettvin, Humberto R Maturana, Warren S McCulloch, and Walter H Pitts. What the frog’s eye tells the frog’s brain. Proceedings of the IRE, 47(11):1940–1951, 1959.

[17] David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2):215–232, 1958.

[18] Sandro Skansi. Introduction to Deep Learning. Springer International Publishing, 2018.

[19] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.

[20] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.

[21] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, Oct 1986.

[22]L eon´ Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.

[23] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[24] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of ini- tialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.

[25] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learn- ing and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.

[26] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. 46

[27] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Adaptive learning rates for each connection, 2012.

[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[29] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

[30] Mykel J Kochenderfer and Tim A Wheeler. Algorithms for optimization. Mit Press, 2019.

[31] Yann LeCun, Leon´ Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[32] David H Hubel and Torsten N Wiesel. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1):215–243, 1968.

[33] John J Hopfield. Neural networks and physical systems with emergent collective computa- tional abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.

[34] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural infor- mation processing systems, pages 5998–6008, 2017.

[36] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar- dini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.

[37] Igor Aizenber. Solving the xor and parity n problems using a single universal binary neuron. Soft Computing, 12:215–222, February 2008.

[38] Igor Aizenberg, Naum N Aizenberg, and Joos PL Vandewalle. Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media, 2013.

[39] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 4(5):6, 2014.

[40] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[41] Clarence Zener. A mathematical aid in optimizing engineering designs. Proceedings of the National Academy of Sciences of the of America, 47(4):537, 1961.

[42] Ro J Duffin. Dual programs and minimum cost. Journal of the Society for Industrial and Applied Mathematics, 10(1):119–123, 1962. 47

[43] Richard J. Duffin, Elmor L. Peterson, and Clarence Zener. Geometric Programming. John Wiley & Sons, Inc., 1967.

[44] Charles S Beightler, Don T Phillips, and Douglass J Wilde. Foundations of optimization. Prentice-Hall, 1979.

[45] U. Passy and D. J. Wilde. Generalized polynomial optimization. SIAM Journal on Applied Mathematics, 15, September 1967.

[46] Charles S. Beightler and Don T. Phillips. Applied Geometric Programming. John Wiley & Sons, Inc, 1976.

[47] Han-Lin Li and Ching-Ter Chang. An approximate approach of global optimization for poly- nomial programming problems. European Journal of Operational Research, 107:625–632, June 1998.

[48] Chia-Hui Huang and Han-Ying Kao. An effective linear approximation method for geometric programming problems. In 2009 IEEE International Conference on Industrial Engineering and Engineering Management. IEEE, December 2009.

[49] Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. Optimization and Engineering, 8:67, April 2007.

[50] A. Zee. Group Theory in a Nutshell for Physicists. Princeton University Press, 2016.

[51] Ognyan Ivanov Zhelezov. N-dimensional rotation matrix generation algorithm. American Journal of Computational and Applied Mathematics, 7:51–57, 2017.

[52] and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[53] Inc. Free Software Foundation. Gnu gcc compiler.

[54] Yann LeCun. The mnist database of handwritten digits.

[55] Nicholas Nethercote and Julian Seward. Valgrind.

[56] Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. Elec- tronic Notes in Theoretical Computer Science, 89, 2003.

[57] Nicholas Nethercote and Julian Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. Proceedings of ACM SIGPLAN 2007 Converence on Programming Language Design and Implementation (PLDI 2007), June 2007.

[58] M. Zalewski. American fuzzy lop, 2015.

[59] Georgios Zoumpourlis, Alexandros Doumanoglou, Nicholas Vretos, and Petros Daras. Non- linear convolution filters for cnn-based learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 4761–4769, 2017. Academic Vita Steven Petrone

To pursue a career or graduate school in computer engineering. Objective

Education Bachelor of Science in Computer Engineering June 2017 - May 2020 suma cum laude Cybersecurity Computational Foundations Minor The Pennsylvania State University, University Park, PA The Schreyer Honors College Thesis: Nonlinear Neuron Discriminant Functions for Alternate Deep Learning Training Algorithms

Relevant Courses Computer Architecture Operating Systems Data Structures and Algorithms Computer Vision Digital Image Processing Computer & Network Security Wireless Security Deep Learning Software Security

Work Undergraduate Research Intern Experience Advanced Technology Laboratory, Lockheed Martin, Arlington, VA May - August 2019  Wrote proprietary software using C, Python and databases Undergraduate Research Intern Penn State Applied Research Laboratory, State College, PA April 2018 - Present  Wrote software using Java, Python, and Javascript  Accepted publication: T. Karn, S. Petrone and C. Griffin. Modeling a Hidden Dynamical System Using Energy Minimization and Kernel Density Estimates. Phys. Rev. E, Accepted Oct., 2019. (In Press).

Technical High Level Languages: C, C++, Python, Matlab, Java, Javascript, HTML, CSS Skills APIs / Frameworks: Django (web), Dash (web), Dakota (optimization) Hardware Descriptive Languages: Verilog Computer Aided Design: Autodesk Inventor, NI Multisim, Xilinx Vivado Assembly Languages: x86, MIPS Miscellaneous Skills: Git, GDB, Wireshark, LaTeX, Vim, Linux shell, AFL

Leadership & Member, Penn State Pulsar Search Collaboratory 2017 - 2019 Involvement Participant, URISE Research Lab Training Course 2018 Member, Penn State Student Space Program Laboratory 2017 - 2018 Captain, Hempfield High School Robotics Team 2015 - 2017

Honors & Spring 2020 Computer Engineering Student Marshal May 2020 Awards Dean’s List 2017- Present Recipient, The Evan Pugh Scholar Senior Award 2019 Recipient, Lockheed Martin Corporation Scholarship 2018 Recipient, The President’s Freshman Award 2018 Recipient, National AP Scholar 2017 Recipient, AP Scholar with Distinction 2017 Thomas M. Henderson Memorial Scholarship for Computer Science 2017 Recipient, AP Scholar with Honor 2016