<<

Artificial Neural Network : a tool for approximating complex functions Aboubacar Konaté

To cite this version:

Aboubacar Konaté. Artificial Neural Network : a tool for approximating complex functions. 2019. ￿hal-02075977￿

HAL Id: hal-02075977 https://hal.archives-ouvertes.fr/hal-02075977 Preprint submitted on 15 Apr 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Artificial Neural Network : a tool for approximating complex functions

Aboubacar Konaté∗

This paper is a mathematical introduction to Artificial Neural Network (ANN). We will show how it is used as a method for approximating functions. We will see how it is used to solve problems in machine learning such as face recogntion, handwritten recogntion or how it is used to solve partial differential .

1 Introduction

In machine learning, a supervised learning problem can be formulated as follows. Let (xi, yi)16i6N be a d dataset where xi R is an input, yi an output and N the of training data. The goal consists in finding a map f∈ such that f(xi) yi, for all i. ≈ In other terms, we are looking for a map that links the input xi to the output yi. The above problem is called regression when the input yi is real and classification when the input yi is discrete. For example, xi can be a list of symptoms of a patient and yi indicates if he suffers from a disease or not.

To solve the problem, some methods are proposed in the litterature ([Kon18b]). Among them, there is the Support Vector Machine (SVM) method. It is an optimization problem and the standard form can T be formulated as follows : we look for a linear function f(x) = θ x + θ0 such as : 1 min θ 2 θ 2k k T yt(θ xi + θ0) > 1 for all i = 1, ..., N. The SVM method is very good method sucessfully applied in real world applications such as face recog- notion, medical diagnose etc. However, it suffers from some drawbacks. First, it is untractable for very large dimension dataset. Another difficulty with the SVM method lies on fact that in some applications, it is not easy to find which feature or inuput xi to choose. For example, how to choose the more relevant asymptoms to caracterize a disease. To overcome this difficulty, Artificial Neural Network method is introduced.

Artificial Neural Network is a very powerful tool for approximating complex functions. It has been suc- cessfully applied to computer vision problems ([SHM+16, SSS16, SHM+16, CXG+16]), system identifica- tion and control ([ZXL15]), finance [Fre17], to diagnose cancers ([GVRP10, BDH+97, ALCP16, LAM+16]) and many others ([NM17, NM18] and to predict foundation settlements [DBT18, oAoANNiH00a,

∗Laboratoire Jacques-Louis Lions, Université Pierre et Marie Curie

1 Figure 1: A cartoon drawing of a biological neuron (left) and its mathematical model (right). oAoANNiH00b, PIC+15, DRN13, ECC05]).

Recently, Artificial neural networks have been used to solve partial differential equations [WCC+18, CCE+18, WY18, BN18].

This manuscript is inspired by Stanford course (http://vision.stanford.edu/teaching/cs231n/). There is nothing new in this paper. Our goal consists only on showing how to use Artificial Neural Network in practice through examples.

2 Mathematical description of artificial neural network 2.1 Biological neural systems The of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems and now become a matter of engineering and achieving good results in Machine Learning tasks.

The basic computational unit of the brain is a neuron. There are approximately 86 billion neurons connected each other with approximately 1014 1015 synapses. The Figure1 shows a cartoon drawing of a biological neuron (left) and a common mathematical− model (right). Each neuron receives input signals from its dendrites and produces output signals along its (single) axon. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. x0) interact multiplicatively (e.g. w0x0) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. w0). The idea is that the synaptic strengths (the weights w) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this rate code interpretation, we model the firing rate of the neuron with an activation function f, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the sigmoid function σ, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section.

2 Input Hidden Output layer layer layer

Input #1 Output Input #2

Input #3 Output Input #4

Figure 2: Example of neural network with 3 layers.

2.2 Mathematical description of Artificial Neural Network A mathematical description of Artificial Neural Network is given in this section. We first remark that ANN are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. More pricesely, Artificial Neural Networks receive an input (a single vector) and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the "output layer" and in classification settings it represents the class scores (see an illustration in Figure2). We consider here, as an example, an ANN with NL + 1 layers. We start by defining the following quantities

l X l l−1 l z = w a + b , l = 1, ,NL 1,NL (1) j jk k j ∀ ··· − k ! l X l l−1 l l a = σ w a + b = σ(z ), l = 1, ,NL 1,NL (2) j jk k j j ∀ ··· − k 0 l when aj = xj (xj is the jth input), wjk the weight from the kth neuron in the (l 1)th layers to the jth l − neuron in the lth layer. Similarly, we define bj as the jth bias in the lth layer. We can rewritte the above quantities in vector form :

l l l−1 l z = w a + b , l = 1, ,NL 1,NL (3) ∀ ··· − l l a = σ(z ), l = 1, ,NL 1,NL. (4) ∀ ··· − Every activation function σ (or non-linearity) takes a single number and performs a certain fixed mathematical operation on it. There are several activation functions you may encounter in practice:

Sigmoid. The sigmoid non-linearity has the mathematical form σ(x) = 1/(1 + exp−x).

Tanh. It squashes a real-valued number to the range [ 1, 1]. Note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: tanh(− x) = 2σ(2x) 1. −

3 ReLU. The Rectified Linear Unit has become very popular in the last few years. It computes the function f(x) = max(0, x). In other words, the activation is simply thresholded at zero. It was found to greatly accelerate the convergence of stochastic gradient descent compared to the • sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form. Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU • can be implemented by simply thresholding a matrix of activations at zero. Unfortunately, ReLU units can be fragile during training and can "die" (i.e. neurons that never • activate across the entire training dataset).

Leaky ReLU. Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes f(x) = 1(x<0)(ax) + 1(x>=0)(x) where a is a small constant.

Maxout. Other types of units have been proposed that do not have the functional form f(wT x + b) where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron that generalizes the ReLU and its leaky version. The Maxout neuron T T computes the function max(w1 x + b1, w2 x + b2). Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1, b1 = 0). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

The second part of an objective is the data loss, which in a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That 1 is, L = P L where N is the number of training data. Lets abbreviate f = σ(x ; w, b) to be the N i i yi i activations of the output layer in a Neural Network. The nature of the data loss function depends on the application we want to apply the ANN method for. However, we can list some common used loss functions : The first common choice is the Softmax classifier that uses the cross-entropy loss: • f ! exp yi Li = log P fyj − j exp

Regression is the task of predicting real-valued quantities, such as the price of houses or the length • of something in an image. For this task, it is common to compute the loss between the predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of the difference.

To summarize, the learning problem can be formulated as follows. Find the optimal weights wop and the optimal biases bop solutions of the following optimization problem : 1 X min Li(xi; w, b) = min L(X; w, b). (5) w,b N w,b i One big challenge with the optimization problem (5) is that the cost functions are mostly non-convex and so the gradient-descent method might get stuck at local minimum points, leading to a sub-optimal solution. The non-convex nature of the Artificial Neural Network is the result of the hidden layer units that have non-linear activation functions, such as sigmoid.

4 3 Solving the optimization problem using Backpropogation algo- rithm 3.1 Gradient descend method In this section, we show how to solve the optimization problem (5). We use an iterative algorithm. That means find a sequence (wk, bk) such that k>0

L(X; w0, b0) L(X; w1, b1) L(X; w2, b2) 6 6 6 ··· To find such a sequence, one approach consists in making a local by applying Taylor expansion and then find a point that satisfies the above inequality. To be precise, we search (wk + t1, bk + t2) such as L(X; wk + t1, bk + t2) 6 L(X; wk, bk). Using Taylor expansion, to get : ∂L ∂L L(X; wk+1, bk+1) L(X; wk, bk) + t1 (X; wk, bk) + t2 (X; wk, bk) ≈ ∂w ∂b ∂L ∂L Let η ]0, 1] (called step size). So by taking t1 = η (X; wk, bk) and t2 = η (X; wk, bk), we get ∈ − ∂w − ∂b

∂L 2 ∂L 2 L(X; wk+1, bk+1) L(X; wk, bk) η (X; wk, bk) η (X; wk, bk) . ≈ − |∂w | − | ∂b | In other words L(X; wk+1, bk+1) 6 L(X; wk, bk). Since we use an iterative method, we must stop at some step. One natural criterion is to stop when the gradient gets very smaller, ie

∂L 2 ∂L 2 (X; wk, bk) + (X; wk, bk) sc |∂w | | ∂b | 6 where sc is a stopping criteria. Data: Initialize w0, b0, criteron sc, k = 0 and compute ∂L 2 ∂L 2 gradnorm = (X; w0, b0) + (X; w0, b0) |∂w | | ∂b | Result: Approximation of the optimization problem while gardnorm > sc do ∂L ∂L 1. compute the gradient ( (X; w , b ), (X; w , b )) ∂w k k ∂b k k 2. Choose a value for η ∂L 3. wk+1 = wk η (X; wk, bk) − ∂w ∂L 4. bk+1 = bk η (X; wk, bk) − ∂b 5. Compute ∂L 2 ∂L 2 gradnorm = (X; wk, bk) + (X; wk, bk) |∂w | | ∂b | 6. k = k + 1 end Algorithm 1: Gradient Descent Algorithm

5 Initialization of the parameters Before we can begin to train the network we have to initialize its parameters. Since, usually, the loss function is not convex so there is no guarantee to converge to optimal solution.

A reasonable-sounding idea might be to set all the initial weights to zero, which we expect to be the "best guess" in expectation. This turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same. Therefore, we still want the weights to be very close to zero, but as we have argued above, not identically zero. As a solution, it is common to initialize the weights of the neurons to small and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

Choice of the step rate η When the objective function L is strongly nonlinear, the first order Taylor expansion is a good approximation only in very small portion. Because of that, the step size η must be carrefully chosen. To overcome this, some methods are proposed : 1. A natural choice for the step size η is to choose it as the solution of the following optimization problem ∂L ∂L min L(X; wk η (X; wk, bk), bk η (X; wk, bk)). η − ∂w − ∂b

The drawback with this method lies on fact that, in some applications, the evaluation of L(X; wk ∂L ∂L − η (X; wk, bk), bk η (X; wk, bk)) for a fixed η is very expensive in term of time computing. ∂w − ∂b 2. AdagradOptimizer, 3. Momentum,

4. RMSprop, 5. AdamOptimizer, 6. AdadeltaOptimizer. For more details about these methods, one can consult [Pat17, Gér17].

3.2 Backpropogation algorithm We have previously seen that solving the optimization problem (5) with the Gradient Descend algorithm necessitates the computational of the gradient. To this end, we will use the Backpropagation algorithm to compute it ([HM94]).

Recall that l X l l−1 l z = w a + b , l = 1, ,NL 1,NL j jk k j ∀ ··· − k ! l X l l−1 l l a = σ w a + b = σ(z ), l = 1, ,NL 1,NL. j jk k j j ∀ ··· − k

6 In vector form : l l l−1 l z = w a + b , l = 1, ,NL 1,NL ∀ ··· − l l a = σ(z ), l = 1, ,NL 1,NL. ∀ ··· − ∂L ∂L The idea for computing L and L consists in using the chain rule derivation. We begin with the ∂wjk ∂bj output layer : L ∂L ∂L ∂zj ∂L L−1 L = L L = L ak . ∂wjk ∂zj ∂wjk ∂zj We define L L ∂L ∂L ∂aj ∂L 0 L δj := L = L L = L σ (zj ). ∂zj ∂aj ∂zj ∂aj Then L ∂L ∂L ∂zj ∂L 0 L L−1 L = L L = L σ (zj )ak . (6) ∂wjk ∂zj ∂wjk ∂aj Similarly, we have L ∂C ∂L zj L L = L L = δj . (7) ∂bj ∂zj ∂bj ∂L ∂L We compute l and l for l = 1, 2, ,NL 1. As before, we have ∂wjk ∂bj ··· −

l ∂L ∂L ∂zj ∂L l−1 l = l l = l ak . ∂wjk ∂zj ∂wjk ∂zj We define ∂L X ∂L ∂zl+1 X ∂zl+1 δl := = k = δl+1 k . j ∂zl l+1 ∂zl j ∂zl j k ∂zk j k j To evaluate the first term on the last line, note that

l+1 X l+1 l l+1 X l+1 l l+1 zj = wjk ak + bj = wjk σ(zj) + bj k k Differentiating, we obtain l+1 ∂zk l+1 0 l l = wjk σ (zj). ∂zj Thus, l X l+1 l+1 0 l δj = δj wjk σ (zj). k Finally, we get : ∂C l l−1 l = δjak . ∂wjk An for the rate of change of the cost with respect to any bias of the network

l ∂C ∂C zj l l = l l = δj. (8) ∂bj ∂zj ∂bj

7 Here is the full algorithm for computing the gradient of the loss function : Data: Input a set of training examples Result: Compute the gradient For each training example x : set the corresponding input activation ax,1 and perform the following step :; for l = 1, 2, 3, ,NL do ··· zl,x = wlax,l−1 + bl, and ax,l = σ(zx,l) (9) end x,L 0 x,L Compute δ = aC σ (z ); ∇ for l = NL 1,NL 2, , 1 do − − ···  T  δx,l wx,l+1 δx,l+1 σ0(zx,l). (10)

end for l = NL, , 1 do update the··· weights according the rule :

η X T wl wl δx,l ax,l−1 (11) n ← − x and the biases according to the rule : η X bl bl δx,l. (12) n ← − x

end Algorithm 2: Method for computing the gradient of the loss function

Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points. There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

1. L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective.

2. L1 regularization is another relatively common form of regularization. 3. Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. 4. Dropout (Figure3) is an extremely effective, simple and recently introduced. It is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.

4 Convolutional Neural Networks (ConvNet)

Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous section. A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture (see6).

8 Figure 3: A graphic illustration of Dropout.

4.1 Components of Convolutional Neural Networks The following are the typical components of a convolutional neural network: Input layer will hold the pixel intensity of the image. For example, an input image with width • 64, height 64, and depth 3 for the Red, Green, and Blue color channels (RGB) would have input dimensions of 64 * 64 * 3 . Convolution layer will take images from the preceding layers and convolve with them the specified • number of filters to create images called output feature maps. The number of output feature maps is equal to the specified number of filters. Till now, CNNs in TensorFlow have used mostly 2D filters; however, recently 3D convolution filters have been introduced. Activation function for CNNs are generally ReLUs, which we discussed previously. The output • dimension is the same as the input after passing through the ReLU activation layers. The ReLU layer adds non-linearity in the network and at the same time provides non-saturating gradients for positive net inputs. Pooling layer will downsample the 2D activation maps along the height and width dimensions. The • depth or the number of activation maps is not compromised and remains the same.

Fully connected layers contain traditional neurons that receive different sets of weights from the • preceding layers; there is no weight sharing between them as is typical for convolution operations. Each neuron in this layer will be connected either to all the neurons in the previous layer or to all the coordinate-wise outputs in the output maps through separate weights. For classificatio Some of the terms with which one should be familiar while defining the convolution layer are as follows: Filter size – Filter size defines the height and width of the filter kernel. A filter kernel of size • 3 3 would have nine weights. Generally, these filters are initialized and slid over the input image for∗ convolution without flipping these filters. Technically, when convolution is performed without flipping the filter kernel it’s called cross-correlation and not convolution. However, it doesn’t matter, as we can consider the filters learned as a flipped version of image- processing filters. Stride - The stride determines the number of pixels to move in each spatial direction while perform- • ing convolution. In normal convolution of signals, we generally don’t skip any pixels and instead compute the convolution sum at each pixel location, and hence we have a stride of 1 along both

9 0 1 1 1 1 0 0 0 1 0 × × × 0 0 1 1 0 1 1 0 0 0 1 4 3 4 1 × × × 0 0 0 1 1 1 0 1 1 0 1 0 1 1 2 4 3 3 × × × 0 0 0 1 1 0 0 0 1 0 = 1 2 3 4 1 ∗ 0 0 1 1 0 0 0 1 0 1 1 3 3 1 1 0 1 1 0 0 0 0 3 3 1 1 0 1 1 0 0 0 0 0 I K I K ∗ Figure 4: Graphic illustration of a convolutional operation.

spatial directions for 2D signals. However, one may choose to skip every alternate pixel location while Padding – When we convolve an image of a specific size by a filter, the resulting image is generally • smaller than the original image. For example, if we convolve a 5 5 2D image by a filter of size 3 3 , the resulting image is 3 3. Padding is an approach that appends∗ zeroes to the boundary of∗ an image to control the size of∗ the output of convolution.

4.2 Convolutional Layer The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting. When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume (see Figure4). The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

4.3 Pooling Layer A pooling operation on an image generally summarizes a locality of an image, the locality being given by the size of the filter kernel, also called the receptive field. The summarization generally happens in the form of max pooling or average pooling. In max pooling, the maximum pixel intensity of a locality is taken as the representative of that locality (see5). In average pooling, the average of the pixel intensities around a locality is taken as the representative of that locality. Pooling reduces the spatial dimensions of an image. The kernel size that determines the locality is generally chosen as 2 * 2 whereas the stride is chosen as 2.

5 Examples 5.1 Exemple 1 : Solving a partial differential equation We will see, in this example, how to use Artificial Neural Network to solve a partial differential equation.

10 2

9 2 9 6 4 3 2 2 2 max pooling 5 0 9 3 7 5 × 9 9 7

0 7 0 0 9 0 9 5 9

7 9 3 5 9 4

Figure 5: Graphic illustration of a Max pooling operation.

Figure 6: Example of convolutional network architecture

Let consider the Poisson equation with Dirichlet boundary condition ∆u =f in Ω − u =gD on ∂Ω where the domain Ω is the unit square [0, 1]2, f is the second member. This equation is used to model, for example, the displacement in solid mechanics [Kon18a, Kon17].

The algorithm for solving this equation is as follows : we approximate u with

u(x, y; w1, w2) = A(x, y; w1) + B(x, y) N(x, y; w2) · where A(x, y; w1) is a neural network that approximates the boundary condition and where B(x, y) satisfies B(x, y) ∂Ω = 0 and N(x, y; w2) is a neural network that approximates the solution inside the domain Ω. | The data set is generated randomly both on the boundaries and in the innerdomain. For every update, M we randomly generated (xi, yi)i=1, (xi, yi) ∂Ω and compute gi = gD(xi, yi). The date are then used for computing loss and . The boundary∈ loss function is defined as follows

nb X 2 Lb = (A(xi, yi; wi) gi) − i=1 M Simirarly, for every update, we randomly generated (xi, yi)i=1, (xi, yi) Ω and compute fi = f(xi, yi). The data are then used for computing loss and derivatives. The inner loss∈ function is defined as follows

nb X 2 L = ( h(xi, yi; wi, w2) fi) ∇ − i=1

11 To evaluate the effectiveness of our method, we will report three error metrics, which can be used as indicators for different purposes (when the exact solution is available). v u M u 1 X 2 L2 = t uh(xi, yi; w1, w2) u(xi, yi) M | − | i=1

We consider the analytical solution

u(x, y) = sin(πx) sin(πy), (x, y) Ω ∈ then we have f(x, y) = ∆u(x, y) = 2π2 sin(πx) sin(πy) − − and the boundary condition is zero boundary condition (that is gD(x, y) = 0). The evolution of the numerical solutions is represented in Figure7.

5.2 Exemple 2 : Fashion items classification We consider, here, a example of classification problem introduced in https://www.tensorflow.org/ tutorials/keras/basic_classification. The training data (https://github.com/zalandoresearch/ fashion-mnist) contains 70,000 grayscale images in 10 categories each of size 28 28 (see Figure8), classified as : × 0 : T-shirt/top • 1 : Trouser • 2 : Pullover • 3 : Dress • 4 : Coat • 5 : Sandal • 6 : Shirt • 7 : Sneaker • 8 : Bag • 9 : Ankle boot • Some results are represented in Figure9, Figure 10, Figure 11, Figure 12, Figure 13.

6 Conclusion

We have seen in this paper the description of Artificial Neural Network and how it is used in practice. Particulary, we used it to solve partial differential equation.

In the future, we will make a deep theoretical study such error estimation and we will make a com- paraison between the ANN approximation and standard methods such finite element method or finite volume method.

12 (a) err = 0.05967 (b) err = 0.04973

(c) err = 0.02228 (d) err = 0.01105

(e) err = 0.00477 (f) err = 0.00140

(g) err = 0.00078 (h) err = 0.00073

13

(i) err = 0.00052 (j) err = 0.00052

Figure 7: Evolution of the Artificial Neural Network approximation. Figure 8: Example of some items.

Figure 9: Some predictions with a neural network with only input and output layer. The efficiency is 0.8395.

14 Figure 10: Some predictions with a neural network with only one hidden layer. The efficiency is 0.8733.

Figure 11: Some predictions with a convolutional neural network with only input and output layer. The efficiency is 0.9067.

15 Figure 12: Some predictions with a convolutional neural network with only one hidden layer. The efficiency is 0.9032.

Figure 13: Somme predictions with a convolutional neural network with two hidden layer. The efficiency is 0.8931.

16 References

[AAB+15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [ALCP16] Elaheh Alizadeh, Samanthe Merrick Lyons, Jordan Marie Castle, and Ashok Prasad. Measuring systematic changes in invasive cancer cell shape using zernike moments. Integrative Biology, 8(11):1183–1193, 2016. [BDH+97] Leonardo Bottaci, Philip J Drew, John E Hartley, Matthew B Hadfield, Ridzuan Farouk, Peter WR Lee, Iain MC Macintyre, Graeme S Duthie, and John RT Monson. Artificial neural networks applied to outcome prediction for colorectal cancer patients in separate institutions. The Lancet, 350(9076):469–472, 1997. [BN18] Jens Berg and Kaj Nyström. A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing, 317:28–41, 2018.

[CCE+18] Siu Wun Cheung, Eric T Chung, Yalchin Efendiev, Eduardo Gildin, and Yating Wang. Deep global model reduction learning. arXiv preprint arXiv:1807.09335, 2018. [CXG+16] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016. [DBT18] E Díaz, Vicente Brotons, and Roberto Tomás. Use of artificial neural networks to predict 3-d elastic settlement of foundations on soils with inclined bedrock. Soils and Foundations, 2018. [DRN13] GS Dwarakish, Shetty Rakshith, and Usha Natesan. Review on applications of neural network in coastal engineering. Artificial Intelligent Systems and Machine Learning, 5(7):324–331, 2013. [ECC05] Leonardo Ermini, Filippo Catani, and Nicola Casagli. Artificial neural networks applied to landslide susceptibility assessment. Geomorphology, 66(1-4):327–343, 2005.

[Fre17] Jordan French. The time traveller’s capm. Investment Analysts Journal, 46(2):81–96, 2017. [Gér17] Aurélien Géron. Hands-on machine learning with Scikit-Learn and TensorFlow: con- cepts, tools, and techniques to build intelligent systems. " O’Reilly Media, Inc.", 2017. [GVRP10] N Ganesan, K Venkatesh, MA Rama, and A Malathi Palani. Application of neural networks in diagnosing cancer disease using demographic data. International Journal of Computer Applications, 1(26):76–85, 2010. [HM94] Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the marquardt algorithm. IEEE transactions on Neural Networks, 5(6):989–993, 1994.

17 [Kon17] Aboubacar Konaté. Méthode multi-échelle pour la simulation d’écoulements miscibles en milieux poreux. PhD thesis, Paris 6, 2017. [Kon18a] Aboubacar Konaté. A multiscale method for a Convection-Diffusion equation. February 2018. [Kon18b] Aboubacar Konate. Un aperçu sur quelques méthodes en apprentissage automatique supervisé. working paper or preprint, December 2018. [LAM+16] Samanthe M Lyons, Elaheh Alizadeh, Joshua Mannheimer, Katherine Schuamberg, Jor- dan Castle, Bryce Schroder, Philip Turk, Douglas Thamm, and Ashok Prasad. Changes in cell shape are correlated with metastatic potential in murine and human osteosarco- mas. Biology open, 5(3):289–299, 2016. [NM17] Mohammad Amin Nabian and Hadi Meidani. Deep learning for accelerated reliability analysis of infrastructure networks. CoRR, abs/1708.08551, 2017. [NM18] Mohammad Amin Nabian and Hadi Meidani. Accelerating stochastic assessment of post- earthquake transportation network connectivity via machine-learning-based surrogates. Technical report, 2018. [oAoANNiH00a] ASCE Task Committee on Application of Artificial Neural Networks in Hydrology. Artificial neural networks in hydrology. i: Preliminary concepts. Journal of Hydrologic Engineering, 5(2):115–123, 2000. [oAoANNiH00b] ASCE Task Committee on Application of Artificial Neural Networks in Hydrology. Ar- tificial neural networks in hydrology. ii: Hydrologic applications. Journal of Hydrologic Engineering, 5(2):124–137, 2000. [Pat17] S. Pattanayak. Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python. Apress, 2017. [PIC+15] DJ Peres, C Iuppa, L Cavallaro, A Cancelliere, and E Foti. Significant wave height record extension by neural networks and reanalysis wind data. Ocean Modelling, 94:128–140, 2015. [SHM+16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016. [SSS16] Nandini Sengupta, Md Sahidullah, and Goutam Saha. Lung sound classification using cepstral-based statistical features. Computers in biology and medicine, 75:118–129, 2016. [WCC+18] Yating Wang, Siu Wun Cheung, Eric T Chung, Yalchin Efendiev, and Min Wang. Deep multiscale model learning. arXiv preprint arXiv:1806.04830, 2018. [WY18] E Weinan and Bing Yu. The deep ritz method: A deep learning-based numerical algo- rithm for solving variational problems. Communications in and Statistics, 6(1):1–12, 2018. [ZXL15] Dimitrios Zissis, Elias K Xidias, and Dimitrios Lekkas. A cloud based architecture capable of perceiving and predicting multiple vessel behaviour. Applied Soft Computing, 35:652–661, 2015.

18