<<

Loss functions

Loss functions How good is your neural network?

Loss functions How good is your neural network?

Inputs e.g. ● Image (pixel values) Inputs ● Sentence (encoded) Outputs ● State (for RL)

Loss Targets Targets e.g. ● ‘Cat’/’Dog’ (encoded) ● Next word (encoded) ● TD(0) (for RL) Loss functions Some pedantry

Loss function and Cost function are often used interchangeably but they are different: ● The loss function measures the network performance on a single datapoint ● The cost function is the average of the losses over the entire training dataset

Our goal is to minimize the cost function.

In reality, we actually use batch losses as a proxy for the cost function.

Loss/Cost functions Some stuff to remember

● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence ● A cost function must faithfully represent the purpose of the network.

Loss/Cost functions Some stuff to remember

● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence gradient descent ● A cost function must faithfully represent the purpose of the network.

cat 0.85 Accuracy: 1 0.15 Loss (CCE): 0.07 dog

Loss/Cost functions Loss functions

● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy ● Regression (i.e. function approximation) ● Squared Error ● Mean Absolute Error ●

Classification loss functions Loss functions

● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy

Classification loss functions Maximum likelihood r r n i

Input e t g a t l t i a o e u C D B H P

- - - - - (.5, .9, .2, .6, .1)

Target (1, 1, 0, 0, 0) - - - - -

t r r g n i a a e t o l t e C i u D B P H Classification loss functions Maximum likelihood r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.5, .9, .2, .6, .1)

Loss =

1 = – /5 [ (1*0.5 + 0*0.5) + (1*0.9 + 0*0.1) + (0*0.2 + 1*0.8) Target (1, 1, 0, 0, 0) - - - - -

+ (0*0.6 + 1*0.4) + (0*0.1 + 1*0.9) ] t r r g n i a a e t o l t e C i u = –0.7 D B P H Classification loss functions

Loss Binary cross-entropy (aka Log loss) n.BCE .n r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.2, .4, .8, .3, .9)

Loss =

Target (0, 0, 1, 0, 1) - - - - -

t r r g n i a a e t o l t e C i u D B P H Classification loss functions

Loss Binary cross-entropy (aka Log loss) n.BCE Torch.n r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.2, .4, .8, .3, .9)

Loss =

1 = – /5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) Target (0, 0, 1, 0, 1) - - - - -

+ (1*log(0.7)) + (1*log(0.9)) ] t r r g n i a a e t o l t e C i u = 0.28 D B P H Classification loss functions y

Loss Binary cross-entropy (aka Log loss) n.BCE Torch.n p r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.2, .4, .8, .3, .9) y = p y = log(p)

Loss =

1 = – /5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) Target (0, 0, 1, 0, 1) - - - - -

+ (1*log(0.7)) + (1*log(0.9)) ] t r r g n i a a e t o l t e C i u = 0.28 D B P H Classification loss functions

oss pyLoss Categorical cross-entropy .NLLL sEntro orch.nn n.Cros T Torch.n

This one does the softmax for you r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.8, 0, 0, .2, 0)

Must be a

Loss =

= – [ 1*log(0.8) ] Target (1, 0, 0, 0, 0) - - - - -

t r r g n i a a e = 0.22 t o l t e C i u D B P H Classification loss functions

oss pyLoss Categorical cross-entropy .NLLL sEntro orch.nn n.Cros T Torch.n

This one does the softmax for you r r n i

Input e t g a l t t o i a e u D C B H P

- - - - - (.8, 0, 0, .2, 0)

Must be a probability distribution

Loss =

This loss function can also be used when you = – [ 1*log(0.8) ] have a single binary Target (1, 0, 0, 0, 0) (e.g. yes/no) output - - - - -

t r r g n i a a e = 0.22 t o l t e C i u D B P H Regression loss functions Loss functions

● Regression (i.e. function approximation) ● ● Mean Absolute Error ● Huber Loss

aka L2 loss Regression loss functions

Loss Mean squared error n.MSE Torch.n

Input

Target Loss

aka L2 loss Regression loss functions

Loss Mean squared error n.MSE Torch.n

Input

Q(St,At)

t t state S & action A

Target

reward Rt+1 Q(St+1,At+1)

t+1 t+1 state S & [best] action A aka L2 loss Regression loss functions

Loss Mean squared error n.MSE Torch.n Loss

Q(St,At) Input

Q(St,At)

t t state S & action A

Target

reward Rt+1 Q(St+1,At+1)

t+1 t+1 state S & [best] action A aka L1 loss Regression loss functions

ss Absolute error n.L1Lo Torch.n

Input

Target Loss

Huber loss combines the best features of MSE and AE: - like MSE for small errors. Regression loss - like AE otherwise.

This avoids over-shooting: Loss oothL1 - MSE has a very steep Huber loss .nn.Sm Torch gradient when the error is large (i.e. for ) because of its quadratic form. Input - AE has a steep gradient when the error is small because it doesn’t have a ‘flat’ bottom.

loss

Target

Loss

error