A Case Study on Sample Complexity, Topology, and Interpolation in Neural Networks

Jonathan Humphries

November 18, 2020

1 Abstract

The general heuristic for determining the sample size to use for training artificial neural networks on real world data sets is “more is better”. Similarly, the heuristic for selecting the number of neurons in the hidden layer of the neural network has also been that “more is better”. However, increased sample complexity and topology in- crease costs in the form of longer training duration and additional computing power. This study uses as its task for learning the completely known and relatively simple problem of numeric addition. It attempts to add to the existing body of knowledge on the double descent curve, sample complexity and topology via a detailed analysis. Though we were unable to identify the exact sample complexity of numeric addition given the available hardware, we were able to identify hyper-parameters for contin- uing this line of research. We also identified that, given a large enough sample size, the training vs testing error will become correlated and negligible early in training. Finally, we identified an important learning difference between the pyTorch neural network framework and a coded from scratch framework.

2 Introduction

In recent years, several papers have been written regarding the Double Descent curve of the bias-variance trade-off in Artificial Neural Networks (ANN) (Belkin et al., 2019, 2018). This theory states that as the number of neurons increases significantly beyond the number of data points in the training set (sample size) to the point where the training set can be fit exactly by the model (which is called interpolation), the model will achieve better generalization than if the number of neurons is less than the

1 sample size of the training set. Historically this was considered an overparameterized neural network which would perform poorly against test data (Geman et al., 1992). Belkin et al. increased model complexity beyond Geman et al.’s work using several data sets trained via Stochastic Gradient Descent (Belkin et al., 2019). Subsequent researchers have repeated these findings using multi-layer or Deep Neural Networks (Nakkiran et al., 2019). In addition to studying the topology of DNNs, it is useful to understand the re- lationship between network topology and sample complexity, the minimum sample size necessary to achieve generalization. For this case study, a data set was created based on numerical addition, which gave us the flexibility to vary sample size while minimizing data set complexities such as intrinsic dimensionality and noise. Unfor- tunately, the available hardware was not able to train networks to interpolation (the point at which the training error becomes zero) within the available time. We made numerous adjustments and extended our original timeline without success. Despite this setback, we were able to show that a large enough sample size and model com- plexity will result in a highly correlated training and testing error. We did see a double descent curve early in training, however continued training further reduced the generalization error, thus we cannot support or refute the double descent curve with our data.

3 Theoretical Framework

This section will discuss the general architecture of an Artificial Neural Network and introduce the relevant work that has already been published on these topics. Where multiple terms exist in the literature, we attempt to clarify the various terms and indicate which will be used in following sections.

3.1 Artificial Neural Network Architecture Artificial neural networks (ANN) are composed of a network of artificial neurons though the terms nodes and parameters are also frequently used. The interest in ANN is due to their ability to approximate any continuous function, provided the activation function is not a polynomial (Leshno et al., 1993). Early ANN contained a single layer of hidden neurons. The pattern of connectivity of neurons across layers of a network is referred to as its topology or model complexity. There are a variety of hyper-parameters used to produce an ANN with a certain set of characteristics. Hyper-parameter optimization is the process of identifying a particular set of hyper-

2 parameters that will perform well on a given problem (Radcliffe, 1993). Given the substantial number of hyper-parameters different ANN may have, finding heuristics for hyper-parameters is a major field of study.

Mathematically a single hidden neuron hi in an ANN can be described by the fol- lowing function. Wherey ˆ is the output of the function. For an Activation function A (x), a vector of weights w, a vector of input values x, w, x ∈ Rd and a scalar b (Note: this scalar is called the bias it is not the bias loss function described later). Let yˆ = A ((w · x) + b) Figure 5 depicts a single hidden layer ANN as a directed graph, fully connected between layers of different sizes. A wide variety of Activation Functions (AF) are used in ANN models. The purpose of the AF is to simulate whether and at what strength, the input values will fire the neuron. The three most common are the sigmoid function, rectified linear unit, and hyperbolic tangent. This study used hyperbolic tangent because it is a continuous function and produces an output [-1,+1]. ex − e−x tanh (x) = ex + e−x

The term Deep Neural Network (DNN) is used to describe a class of neural network topologies with more than one layer of hidden neurons, see Figure 1. In other cases, Deep Neural Networks are ANN which combine one or more hidden layers with a machine learning based pre-processing algorithm(s) (Liu et al., 2017). Hyper-Parameter Selection is often a practice of trial and error (Smith, 2018). Signifi- cant research has been conducted to develop heuristics and tools for hyper-parameter selection. This has been informed by the “no free lunch” (NFL) theorems which es- tablish that for any given problem some algorithms will outperform others at the expense of under-performing on other classes of problems (Wolpert and Macready, 1997; Altenberg, 2016).

3.2 Training with Stochastic Gradient Descent One of the most common methods for training an ANN is Stochastic Gradient De- scent (SGD) (Rumellhart, 1986). SGD is an iterative method for adjusting the parameters of an ANN by minimizing a loss function (also called optimizing an ob- jective function) (Ruder, 2016). Let Yˆ be the set of outputs produced by the ANN

3 Input Layer Hidden Layer Hidden Layer Output Layer

Figure 1: Deep Neural Network with Two Hidden Layers and Y be the set of desired outputs from the training data set. Then the loss func-   tion L Y,Yˆ is reduced under each iteration through the training data. A single complete pass over all data points in the data set is often called an epoch. The significance of each iterative adjustment to a neuron’s weights w is controlled by the hyper-parameter learning rate η. Over-fitting, which is described later, is mitigated by a number of ad-hoc techniques, including the regularization hyper-parameter mo- mentum α. Consider a data point in a training data set (x, y) where x is the input vector and y is the desired output vector. Let i be the index of a particular neuron and t be in the index of the training epoch. Now ∂L(ˆy(ˆx, w(t)), y) ∆wi(t + 1) = −η + α∆wi(t) ∂wi is the change in wi(t) in the interval (t, t + 1). This study used two different Neural

4 Network Frameworks, described in section 4.5. As the loss function is reduced, the change to weights becomes less significant and eventually training will be unable to continue (Bottou, 2010). Artificial Neural Net- works trained via SGD may over fit the training data, meaning they perform well on the trained examples and poorly on unseen data. To measure this, a common practice is to split the data set into two groups, one for training the network, the training data set, and a second for testing it, the testing data set. The difference between the loss function on the testing data and zero, is the generalization error and indicates how well the ANN will perform against data not in the training set (Zhang et al., 2016).

3.3 Metrics and Common Loss Functions Generalization error is one of several important metrics for measuring the ability of the ANN to approximate the unknown function. Often these functions measure an error rate between the desired and actual values. These error rates provide different details on the specific types of errors the ANN might make and thus help us select better hyper-parameters. Intuitively, one might use the difference between the de- sired and actual outputs as the error for a single data point. Let Y be the set of expected values, X be the set of inputs, e be the error term and the ANN prediction is fˆ(X). ˆ ei = f (Xi) − Yi

In this case, the average error for an ANN model on a given data set can be calculated as the average of the absolute value of the difference between the expected value and the prediction value. Minimizing the average error is a useful metric for measuring the progress of the ANN during training as it keeps the same scale as the expected value and is O(n). N !   ˆ X ˆ L Y,Y = f (Xi) − Yi /N i

Given that our data set is discrete, we define accuracy as the percent of data points for which the expected and predicted values are equal. Accuracy is often close to zero until training is near interpolation, the point at which training cannot continue because the loss function has been minimized. However, as the loss function nears interpolation, the accuracy tends to increase towards one and becomes more useful.

5 Let N ! X ˆ Acc = 1{f (xi) 6= yi} /N i where 1{bool} is a function that return 1 if bool is true and 0 if bool is false (Sharma et al., 2014). In this study, we found a potentially important feature when comparing the training vs testing error (TTE) values during training. Our results will support the possibility that the TTE may be an early indicator of whether a low generalization error is possible for a given sample size and topology. Let     TTE = L Y,Yˆ − L Y,Yˆ testing training

The machine learning literature commonly reports in terms of the total error and its constituent parts, the bias, variance and an irreducible error. These three parts provide a level of granularity that assists in understanding the exact performance of an ANN against unknown data. The total error can be defined as the expected value E[X], or probability weighted average, of the square of the expected value Y minus the value predicted by the model fˆ(X). The total error, sometimes called the mean squared error is then:  2  ˆ  Err (X) = E Yi − f (Xi)

The bias (as a loss function not part of the function of an ANN neuron), measures how close the weighted average of the predicted value of the ANN is to the expected value. This indicates whether or not the ANN is underfitting the data set and missing relevant features.  h i 2 Bias2 = E fˆ(X) − Y The variance is the variability of the prediction for a given data point. High variance can indicate the ANN is overfitting the data set and may have a poor generalized error.  h i2 V ariance = E y − E fˆ(x)

Some data sets have noisy or incorrect values. This creates an irreducible error for even the best model, by which the total error cannot be further reduced. By splitting the total error into the bias, variance and irreducible error, we can define

6 an exit point for training. Please note, this numeric addition data set lacks noise and therefore the irreducible error is zero. The Total Error is defined as:

Err (x) = Bias2 + V ariance + IrreducibleError

Figure 2 illustrates areas of high and low bias & high and low variance. Minimiz- ing both these values results in a low total error score and a healthy set of hyper- parameters for an ANN.

Figure 2: Bias and variance in overfitting. Reproduced from (Domingos, 2012)

3.4 Bias-Variance Trade-off and Double Descent Geman et al. (1992) originally posited the bias-variance dilemma as a limitation where, as the model complexity increases, the bias falls and variance increases. If the model complexity is too small, the model may under-fit the data (high bias and low variance). If the model complexity is too large, the model may over-fit the data (low bias and high variance). Their analysis depicted a U-shaped curve (Figure 3) called the bias-variance trade-off which indicates a possible ideal topology that would yield the lowest possible total error for a given problem (Geman et al., 1992). However, new research has subsumed their results by both repeating the U shaped curve and extending the model by demonstrating a second descent when topology increases significantly (Belkin et al., 2018). Belkin et al.’s extended“double descent”

7 Figure 3: Bias, Variance, and the U shaped Total Error curve. Reproduced from (Singh, 2018) curve (Figure 4) where total error continues to be reduced as topology increases. Nakkiran et al. conducted research on multi-layer ANN with similar results (Nakki- ran et al., 2019).

3.5 Sample Complexity In interpolating networks, as an ANN is trained via SGD, the error rate against the training data moves to zero. At that point it is not possible to achieve a better total error. Most real world problems have a finite amount of data and researchers are encouraged to use all available data for training. However, one of the goals of this study is to add to the existing literature on sample complexity by studying a domain with an intrinsically low sample complexity and identify the minimum sample size for a low generalization error. Niyogi and Girosi indicate that in order for the generalization error to go to zero, the number of parameters should grow more slowly than the number of data points and there should be a fixed optimum number of parameters. They then provide a function and proof as a guide to selecting these

8 Figure 4: The double descent curve. Topology is p and Total Error is Risk. Repro- duced from (Belkin et al., 2019) values (Niyogi and Girosi, 1995). This is now considered the “classical” regime and has been subsumed by Belkin et al. (2018, 2019) who demonstrate that increasing the number of parameters significantly beyond the interpolation threshold will yield a lower generalized error value.

4 Methodology

In order to complete a broad study of these topics with a minimum of complexity, a simple data set was chosen. We also hope that this reduces skew towards the unique features of a particular data set. Future work will need to verify these assumptions by duplicating this research using a variety of data sets and training algorithms. While it is important to study both categorization and regression problems, this study is limited to regression.

9 4.1 Data set Description The regression problem selected was the addition of two numbers. The data set was created by randomly selecting two integers [-10,000 , 10,000] then adding them together to calculate the predicted numerical output of (x1 + x2 = y). This function was called at the beginning of each test to generate the required number of data points. Once the data points were generated, 20% of the data points were held in the testing data set. Furthermore, we hypothesized that a simpler data set would allow the ANN to achieve interpolation more rapidly, allowing for a broader study using less processing power. This data was normalized via Min-Max scaling to a range optimized for the neural network framework being used (described in more detail in section 4.6). The two normalized values were fed directly into the two node input layer. The number of hidden layer nodes varied and some tests were run with multiple hidden layers; however, all nodes are fully connected to the previous layers nodes. The single output node is fully connected to the previous hidden layer node and that output value is scaled back to the original range (±10k) and used as the predicted value.

4.2 Metrics and training termination During training, the average error and accuracy scores are calculated with each epoch. When a new best value for one of these two scores is reached, the system records the iteration. Training can be terminated when any one of three training termination hyper-parameters (see Figure 6) are reached. Once training is terminated, either because one of these metrics reaches the prescribed value (interpolation) or a new best metric has not been recorded within the specified number of training iterations, then the results for that model were recorded.

4.3 Number of Hidden Layer Neurons Given the small intrinsic dimensionality of the data set, we suspected that few neu- rons were needed to achieve a low average error. In order to ensure we found a minimum topology that could achieve interpolation we also needed to show results with too small a topology. As Belkin et al. (2018) indicated that the number of neurons should exceed sample complexity, we also selected a topology four times larger than a mid sized data set.

10 Input Hidden Output layer layer layer

x1

w11 h1

w21 w1 x2 yˆ1 w12 . w22 . w2 hn

Figure 5: A single hidden layer artificial network network

4.4 Training Termination Criteria - No New Best Stochastic Gradient Descent makes smaller and smaller changes as models move closer to interpolation, this results in diminishing returns as training continues. We established a training termination condition where if no additional improvement to Average Error or Accuracy was achieved within a certain number of training epochs, training would be terminated. We call this the No New Best hyper-parameter (NNB). It also provides a loose correlation to training speed, ANN models regularly achieve new best metrics have more training epochs and are more likely to reach interpolation.

4.5 Two Neural Network Frameworks For reasons that will become evident during the results section, this study was per- formed twice on two separate Neural Network Frameworks. The original work was conducted on a private“From Scratch” Neural Network Framework (FS) which was originally developed using the explications of McCaffrey (McCaffrey, 2012). The FS Framework has been in development by this author for more than five years and is used commercially. It contains a number of useful features for long studies including the ability to queue up a sequence of tests for computing nodes to pull from, pausing

11 and resuming tests, and more. Let the loss function be

n X L (x1, x2, w1, w2, b) := wi3 tanh(wi1x1 + wi2x2 + bi) + bn+1 − (x1 + x2) i=1 . The specific implementation of SGD in the FS framework for this dataset is:

∂ ∆wi1(t + 1) = α∆wi1(t) − η L(x1, x2, w1, w2, w3, b) ∂wi1 n ! X = α∆wi1(t) − η sign wi3 tanh(wi1x1 + wi2x2 + bi) + bn+1 − (x1 + x2) i=1   2 × x1wi3 1 − tanh(wi1x1 + wi2x2 + bi)

The study was duplicated using pyTorch (PY) (Paszke et al., 2019), an open source python library compatible with the UH HPC cluster.

4.6 FS Model Hyper-Parameters This section describes the hyper-parameters of the FS framework. The search space of possible neural network models was limited to those specified in Figure 6. Prior to formal testing a number of informal tests were conducted to determine useful constant values for the Min-Max scaling range, SGD Learning Rate, and SGD Momentum. Input and Output (X,Y ) values are normalized via Min-Max Normalization which has been shown to reduce overall training duration, act as a regularization agent, and sometimes eliminate the need for Neural Network Dropout (Ioffe and Szegedy, 2015; Saranya and Manikandan, 2013). Min-Max normalization performs a linear alteration on the original training set. The original values are scaled to within a new range. For a value xc in the set of all possible values for a particular dataset feature C. If the training dataset were represented as a two-dimensional array, then C would refer to all possible values for that column of data.

(xc − minC ) / (maxC − minC ) ∗ (new maxC − new minC ) + new minC

As mentioned above, the hidden layer activation function was hyper-tangent and no activation function was used on the output layer, these are common hyper-parameters for regression problems.

12 The hyper-parameters for this study were varied within sets as described here: • Min-Max Scaling: [−13, 13] • Learning Rate η: 0.001 • Momentum α: 0.01 • Data set size: {10, 100, 1000, 10000} • Hidden Layer Neurons: {10, 15, 25, 50, 400} • Layers: {1, 2, 3} • Terminate training at a No New Best: {600, 10000, 25000} • Terminate training at Average Error better than: 0.1 • Terminate training at a Accuracy of: 1

Figure 6: Hyper-Parameters in the code from scratch framework

4.7 PyTorch Hyper-Parameters PyTorch is a publicly available open source Artificial Intelligence Framework designed for usability and performance (Paszke et al., 2019). As we encountered unexpected results from the FS framework, we decided to reattempt our planned experiments using a different framework compatible with the UH HPC cluster. For these experi- ments, we used a sample size of 200 with 50% reserved for testing. After some initial tests we selected the hyper-parameters in Figure 7. Note, PyTorch treats momen- tum differently that the FS framework; however both values are roughly equivalent. PyTorch did not perform better with Min-Max scaling greater than one, thus we normalized values to one. Due to time constraints, we elected to terminate at a maximum number of epoch rather than coding a No New Best termination criteria; however during testing we monitored the number of epoch from which a new best had been achieved and selected a max epoch significantly greater than the average best epoch.

4.8 Topology Descriptions Within the figures and tables in this study we will use the convention p × l to indicate the hidden layer neurons p spread evenly across l layers. For example, to indicate a topology of ten neurons across a single hidden layer 10 × 1 would be used. To indicate four-hundred neurons across three layers, 400 × 3 would be used. In the event the number of neurons cannot be evenly spread across the layers, the excess neurons were added to the last hidden layer.

13 • Min-Max Scaling: [−1, 1] • Learning Rate η: 0.0001 • Momentum α: 0.9 • Data set size: {100} • Hidden Layer Neurons: {10 − 300intervalsof5} • Layers: {1} • Terminate training at Max Epoch: {100m} • Terminate training at Average Error better than: 0 • Terminate training at a Accuracy of: 1

Figure 7: Hyper-Parameters used in the pyTorch framework

4.9 Computing Power For this study I assembled a small datacenter of 6 machines totally approximately 30 cores each operating between 3–4 ghz, with a minimum of 8gb of ram and solid state drives. While we did run some tests on the UH HPC cluster, we found that the processing power available on the shared nodes was comparable; however, the shared nodes have a time limit of 4–7 days and many of our tests exceeded this duration.

5 Results

The results reported below were compiled from more than a thousand different tests, the full test results and pyTorch source code are available in a public bit- bucket repo: https://bitbucket.org/JonHumphries/thesispy. Results from two neu- ral network frameworks will be shown separately as From Scratch (FS) and pytorch (PY) respectively. The FS test results can be further subdivided into two groups, those tests with a NNB hyper-parameter of six-hundred (NNB600) and ten-thousand (NNB10K). The analysis of these two groups led to some interesting findings, most notably that the effect of sample size on the generalization error is visible early in training and a potentially significant performance optimization may exist in a merge of the FS and PY frameworks.

5.1 Recreating the Double Descent Curve In order to determine if our findings would mirror the previous work on the double descent curve, we reviewed the results of the NNB600 models and found a fairly clear

14 double descent curve in the models with a single hidden layer and one-hundred data points (Figure 8). However, we were uncertain if this could be considered a“true” double descent curve because the training error did not achieve interpolation.

Figure 8: Error on single hidden layer with a sample size of 100 and termination at No New Best 600.

We then ran a significantly more models to develop a larger image of the double descent curve and determine at which point interpolation could be achieved. Even a model complexity above 500 neurons was not able to achieve interpolation given the hyper-parameters selected. This was the reason to switch to the UH HPC and pyTorch framework; perhaps using a different Neural Network Framework or additional computing power would yield different results. We coded the same study in pyTorch and ran a series of exploratory tests to identify useful hyper-parameters. We also explored Rectified Linear Unit as an activation function (ReLu) (Agarap, 2018), though we chose to use hyperbolic tangent because those tests had a lower error rate. For the pyTorch tests, we ran each hyper-parameter set seven times, producing the results in Figure 9. The results are similar to the FS framework, the training and testing error are highly correlated and the training error does not reach interpolation. It is worth noting that neither

15 Google Colab nor the UH HPC cluster’s shared nodes yielded significant performance advantages over what was available in our private cluster. Furthermore, the UH HPC shared node’s time limitations would require the advanced pause/resume features to complete the planned tests.

Figure 9: Error on single hidden layer with a sample size of 100 and termination at 100m epochs. We had better performance with pyTorch when data was normalized between ±1 and only tracked the re-scaled Average Error for the Testing data. This chart depicts Training and Testing together and thus uses the normalized values.

For both the FS and Py frameworks the critical point for the second descent oc- curs after a model complexity of forty. If we had reached interpolation, we could have claimed successful reproduction of the double descent curve of Belkin et al. (2018). The fact that major features of the curves were nearly identical across both frameworks indicates that despite different frameworks, the difficulty in reaching in- terpolation was not due to the framework but to some unique property of the dataset or hyper-parameters selected.

5.2 The very long study, elusive interpolation In all of our tests, we never achieved interpolation (when the training error is zero). We setup a pytorch test with no maximum iteration, a learning rate of 1e-6 and a model complexity of 75 (notice the interesting dip in Figure 9 around 75-80) then allowed it to run for weeks. At 1.7 billion epochs, the normalized (±1) training error

16 was 8.6e-7 and testing error was 3.7e-6. In the original scale (±10k), the testing error rate was 0.067 which yields a greater than 90% accuracy. This significantly exceeds the training and testing error rates of all tests at the 1e-5 learning rate. The closest comparable tests were achieved on the FS framework with a sample size of 10k and a model complexity of 25 after 347k epochs. It is worth noting that while the FS framework can achieve comparable results in fewer epochs than pyTorch, chronologically, they are similar.

5.3 Low Training vs Testing Error in Early Training Some of our test resulted in an insignificant Training vs Testing Error (TTE) and we sought to identify the hyper-parameter(s) that supported this. By logging the TTE after each training epoch it became evident that if the hyper-parameter values for model complexity and sample size were not above some minimum value, then the generalization error would be higher. Furthermore, when model complexity and sample size were sufficient for the problem, the TTE was minimized long before training terminated. Sufficient sample size and model complexity also resulted in correlated Training and Testing errors.

Figure 10: High TTE - training epochs with ten data points and ten nodes across three layers. This is typical of an unhealthy network. This represents an unhealthy curve as interpolation will not be achieved and the TTE is high.

17 Figure 11: Lower TTE - training epochs with one hundred data points and twenty- five nodes across two layers. Reader can see the training and testing error rates are more closely related.

Figure 12: Extremely Low TTE - training epochs with ten thousand data points and twenty-five nodes across three layers. Testing follows training almost exactly and the chart software is unable to display both (Testing is drawn on top of Training).

18 Figure 13: Error on ten thousand data points and twenty-five nodes across three layers. The same test as Figure 12

Figures 10 through 12 indicate, as sample size increased the TTE reduced. Figure 13 is the TTE for the same test in Figure 12. It indicates that the variance in the TTE stabilized after a few thousand training epochs and further reduced as the training proceeded towards interpolation. On tests where training was conducted for hundreds of thousands of epochs, the TTE became ±0.01 and negligible. In these cases the sample size was greater than the sample complexity for numeric addition.

5.4 Incidental Findings with potential performance impacts After becoming familiar with the PyTorch framework we noticed a difference between it and the FS framework. Both frameworks took similar chronological times to achieve similar results; however, the FS framework would do it in significantly fewer iterations. This is flagged as an area for future research; but as of the publication of this thesis, we have not discovered the root cause(s) of this difference. We did extend the delivery date of thesis as far as possible while performing as much testing as possible in the additional time. These tests only served to rule out certain potential causes.

19 6 Proposed future research

All areas of our original proposal remain open until interpolation can be achieved. If interpolation is not possible, what about numeric addition or the framework prevents this? To continue this study, we propose additional tests with smaller learning rates, no max epoch, and model complexity larger than 200. If interpolation is not achieved, the model complexity should be increased and/or the learning rate reduced. Once interpolation is achieved, then the model complexity can be decreased at regular intervals until interpolation cannot be achieved in a similar number of epoch. We would also like to continue tests with a sample size at regular intervals below 100 (though for the testing set, the sample should remain at 100). At what point does the correlation between training and testing error break? This would indicate the sample complexity for addition. Once this is identified, what impact does ad- justing the model complexity have, are model complexity and sample complexity correlated? Possibly most interesting, as a result of running the tests in both the FS and PY frameworks we identified an important difference with potentially massive perfor- mance implications. We intend to pursue this topic by forking PyTorch and sys- tematically replacing sections of the learning algorithm that may be found to differ from the FS framework with code that matches the FS framework. Does making this change to PyTorch result in both faster learning and faster chronological train- ing? Assuming that these changes to PyTorch result in a chronological improvement, what exact performance improvement is gained on a standard dataset such as MINST?

6.1 Conclusion While our original goal of identifying the sample complexity of numeric addition was not achievable given our hardware and time constraints, we did find interesting results in early identification of insufficient data to model a problem and a potential learning improvement in training via Stochastic Gradient Descent. We also provide a series of future research projects on both topics.

20 6.2 Acknowledgements I’d like to take a moment to recognize Prof. Lee Altenberg for taking me on despite his tenuous contract and us working on different islands. I’d also like to thank Prof. Kimberly Binsted for being my first advisor as well as the other members of my committee. Prof. Ramon Figueroa-Centeno for countless hours helping me with mathematics. Prof. H. Keith Edwards for encouraging me to go to grad school and helping me through the pre-requisites. Finally, my parents, friends, and family whose support kept me healthy and well during one of the most difficult times of my life. Mahalo.

References Agarap, A. F. (2018). using rectified linear units (relu). arXiv preprint arXiv:1803.08375. Altenberg, L. (2016). Evolutionary computation. Encyclopedia of Evolutionary Bi- ology, 2:40–47. Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118. Belkin, M., Hsu, D., and Xu, J. (2019). Two models of double descent for weak features. arXiv preprint arXiv:1903.07571. Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. and Saporta, G., editors, Proceedings of COMPSTAT’2010, pages 177–186, Heidelberg. Physica-Verlag HD. Domingos, P. (2012). A few useful things to know about machine learning. Commun. ACM, 55(10):78–87. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867.

21 Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., and Alsaadi, F. E. (2017). A survey of deep neural network architectures and their applications. Neurocomputing, 234:11– 26. McCaffrey, J. (2012). Test Run neural network back-propagation for programmers. MSDN, 27(10). Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Niyogi, P. and Girosi, F. (1995). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation, 8. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., dAlch´e-Buc,F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc. Radcliffe, N. J. (1993). Genetic set recombination and its application to neural network topology optimisation. Neural Computing and Applications, 1:1:67–90. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Rumellhart, D. E. (1986). Learning internal representations by error propagation. Parallel distributed processing, 1:318–362. Saranya, C. and Manikandan, G. (2013). A study on normalization techniques for privacy preserving . International Journal of Engineering and Tech- nology (IJET), 5(3):2701–2704. Sharma, R., Nori, A. V., and Aiken, A. (2014). Bias-variance tradeoffs in program analysis. SIGPLAN Not., 49(1):127–137. Singh, S. (2018). Understanding the bias variance tradeoff. Accessed: 2020-02-08. Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRR, abs/1803.09820.

22 Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1:67–82. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530.

23