A Case Study on Sample Complexity, Topology, and Interpolation in Neural Networks

A Case Study on Sample Complexity, Topology, and Interpolation in Neural Networks Jonathan Humphries November 18, 2020 1 Abstract The general heuristic for determining the sample size to use for training artificial neural networks on real world data sets is \more is better". Similarly, the heuristic for selecting the number of neurons in the hidden layer of the neural network has also been that \more is better". However, increased sample complexity and topology in- crease costs in the form of longer training duration and additional computing power. This study uses as its task for learning the completely known and relatively simple problem of numeric addition. It attempts to add to the existing body of knowledge on the double descent curve, sample complexity and topology via a detailed analysis. Though we were unable to identify the exact sample complexity of numeric addition given the available hardware, we were able to identify hyper-parameters for contin- uing this line of research. We also identified that, given a large enough sample size, the training vs testing error will become correlated and negligible early in training. Finally, we identified an important learning difference between the pyTorch neural network framework and a coded from scratch framework. 2 Introduction In recent years, several papers have been written regarding the Double Descent curve of the bias-variance trade-off in Artificial Neural Networks (ANN) (Belkin et al., 2019, 2018). This theory states that as the number of neurons increases significantly beyond the number of data points in the training set (sample size) to the point where the training set can be fit exactly by the model (which is called interpolation), the model will achieve better generalization than if the number of neurons is less than the 1 sample size of the training set. Historically this was considered an overparameterized neural network which would perform poorly against test data (Geman et al., 1992). Belkin et al. increased model complexity beyond Geman et al.'s work using several machine learning data sets trained via Stochastic Gradient Descent (Belkin et al., 2019). Subsequent researchers have repeated these findings using multi-layer or Deep Neural Networks (Nakkiran et al., 2019). In addition to studying the topology of DNNs, it is useful to understand the re- lationship between network topology and sample complexity, the minimum sample size necessary to achieve generalization. For this case study, a data set was created based on numerical addition, which gave us the flexibility to vary sample size while minimizing data set complexities such as intrinsic dimensionality and noise. Unfor- tunately, the available hardware was not able to train networks to interpolation (the point at which the training error becomes zero) within the available time. We made numerous adjustments and extended our original timeline without success. Despite this setback, we were able to show that a large enough sample size and model complexity will result in a highly correlated training and testing error. We did see a double descent curve early in training, however continued training further reduced the generalization error, thus we cannot support or refute the double descent curve with our data. 3 Theoretical Framework This section will discuss the general architecture of an Artificial Neural Network and introduce the relevant work that has already been published on these topics. Where multiple terms exist in the literature, we attempt to clarify the various terms and indicate which will be used in following sections. 3.1 Artificial Neural Network Architecture Artificial neural networks (ANN) are composed of a network of artificial neurons though the terms nodes and parameters are also frequently used. The interest in ANN is due to their ability to approximate any continuous function, provided the activation function is not a polynomial (Leshno et al., 1993). Early ANN contained a single layer of hidden neurons. The pattern of connectivity of neurons across layers of a network is referred to as its topology or model complexity. There are a variety of hyper-parameters used to produce an ANN with a certain set of characteristics. Hyper-parameter optimization is the process of identifying a particular set of hyper- 2 parameters that will perform well on a given problem (Radcliffe, 1993). Given the substantial number of hyper-parameters different ANN may have, finding heuristics for hyper-parameters is a major field of study. Mathematically a single hidden neuron hi in an ANN can be described by the following function. Wherey ^ is the output of the function. For an Activation function A (x), a vector of weights w, a vector of input values x, w; x 2 Rd and a scalar b (Note: this scalar is called the bias it is not the bias loss function described later). Let y^ = A ((w · x) + b) Figure 5 depicts a single hidden layer ANN as a directed graph, fully connected between layers of different sizes. A wide variety of Activation Functions (AF) are used in ANN models. The purpose of the AF is to simulate whether and at what strength, the input values will fire the neuron. The three most common are the sigmoid function, rectified linear unit, and hyperbolic tangent. This study used hyperbolic tangent because it is a continuous function and produces an output [-1,+1]. ex − e−x tanh (x) = ex + e−x The term Deep Neural Network (DNN) is used to describe a class of neural network topologies with more than one layer of hidden neurons, see Figure 1. In other cases, Deep Neural Networks are ANN which combine one or more hidden layers with a machine learning based pre-processing algorithm(s) (Liu et al., 2017). Hyper-Parameter Selection is often a practice of trial and error (Smith, 2018). Signifi- cant research has been conducted to develop heuristics and tools for hyper-parameter selection. This has been informed by the \no free lunch" (NFL) theorems which es- tablish that for any given problem some algorithms will outperform others at the expense of under-performing on other classes of problems (Wolpert and Macready, 1997; Altenberg, 2016). 3.2 Training with Stochastic Gradient Descent One of the most common methods for training an ANN is Stochastic Gradient De- scent (SGD) (Rumellhart, 1986). SGD is an iterative method for adjusting the parameters of an ANN by minimizing a loss function (also called optimizing an ob- jective function) (Ruder, 2016). Let Y^ be the set of outputs produced by the ANN 3 Input Layer Hidden Layer Hidden Layer Output Layer Figure 1: Deep Neural Network with Two Hidden Layers and Y be the set of desired outputs from the training data set. Then the loss function L Y;Y^ is reduced under each iteration through the training data. A single complete pass over all data points in the data set is often called an epoch. The significance of each iterative adjustment to a neuron's weights w is controlled by the hyper-parameter learning rate η. Over-fitting, which is described later, is mitigated by a number of ad-hoc techniques, including the regularization hyper-parameter mo- mentum α. Consider a data point in a training data set (x; y) where x is the input vector and y is the desired output vector. Let i be the index of a particular neuron and t be in the index of the training epoch. Now @L(^y(^x; w(t)); y) ∆wi(t + 1) = −η + α∆wi(t) @wi is the change in wi(t) in the interval (t; t + 1). This study used two different Neural 4 Network Frameworks, described in section 4.5. As the loss function is reduced, the change to weights becomes less significant and eventually training will be unable to continue (Bottou, 2010). Artificial Neural Net- works trained via SGD may over fit the training data, meaning they perform well on the trained examples and poorly on unseen data. To measure this, a common practice is to split the data set into two groups, one for training the network, the training data set, and a second for testing it, the testing data set. The difference between the loss function on the testing data and zero, is the generalization error and indicates how well the ANN will perform against data not in the training set (Zhang et al., 2016). 3.3 Metrics and Common Loss Functions Generalization error is one of several important metrics for measuring the ability of the ANN to approximate the unknown function. Often these functions measure an error rate between the desired and actual values. These error rates provide different details on the specific types of errors the ANN might make and thus help us select better hyper-parameters. Intuitively, one might use the difference between the desired and actual outputs as the error for a single data point. Let Y be the set of expected values, X be the set of inputs, e be the error term and the ANN prediction is f^(X). ^ ei = f (Xi) − Yi In this case, the average error for an ANN model on a given data set can be calculated as the average of the absolute value of the difference between the expected value and the prediction value. Minimizing the average error is a useful metric for measuring the progress of the ANN during training as it keeps the same scale as the expected value and is O(n). N ! ^ X ^ L Y;Y = f (Xi) − Yi =N i Given that our data set is discrete, we define accuracy as the percent of data points for which the expected and predicted values are equal. Accuracy is often close to zero until training is near interpolation, the point at which training cannot continue because the loss function has been minimized.

Load more