Loss functions
Loss functions How good is your neural network?
Loss functions How good is your neural network?
Inputs e.g. ● Image (pixel values) Inputs ● Sentence (encoded) Outputs ● State (for RL)
Loss Targets Targets e.g. ● ‘Cat’/’Dog’ (encoded) ● Next word (encoded) ● TD(0) (for RL) Loss functions Some pedantry
Loss function and Cost function are often used interchangeably but they are different: ● The loss function measures the network performance on a single datapoint ● The cost function is the average of the losses over the entire training dataset
Our goal is to minimize the cost function.
In reality, we actually use batch losses as a proxy for the cost function.
Loss/Cost functions Some stuff to remember
● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence gradient descent ● A cost function must faithfully represent the purpose of the network.
Loss/Cost functions Some stuff to remember
● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence gradient descent ● A cost function must faithfully represent the purpose of the network.
cat 0.85 Accuracy: 1 0.15 Loss (CCE): 0.07 dog
Loss/Cost functions Loss functions
● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy ● Regression (i.e. function approximation) ● Mean Squared Error ● Mean Absolute Error ● Huber Loss
Classification loss functions Loss functions
● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy
Classification loss functions Maximum likelihood r r n i
Input e t g a t l t i a o e u C D B H P
- - - - - (.5, .9, .2, .6, .1)
Target (1, 1, 0, 0, 0) - - - - -
t r r g n i a a e t o l t e C i u D B P H Classification loss functions Maximum likelihood r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.5, .9, .2, .6, .1)
Loss =
1 = – /5 [ (1*0.5 + 0*0.5) + (1*0.9 + 0*0.1) + (0*0.2 + 1*0.8) Target (1, 1, 0, 0, 0) - - - - -
+ (0*0.6 + 1*0.4) + (0*0.1 + 1*0.9) ] t r r g n i a a e t o l t e C i u = –0.7 D B P H Classification loss functions
Loss Binary cross-entropy (aka Log loss) n.BCE Torch.n r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.2, .4, .8, .3, .9)
Loss =
Target (0, 0, 1, 0, 1) - - - - -
t r r g n i a a e t o l t e C i u D B P H Classification loss functions
Loss Binary cross-entropy (aka Log loss) n.BCE Torch.n r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.2, .4, .8, .3, .9)
Loss =
1 = – /5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) Target (0, 0, 1, 0, 1) - - - - -
+ (1*log(0.7)) + (1*log(0.9)) ] t r r g n i a a e t o l t e C i u = 0.28 D B P H Classification loss functions y
Loss Binary cross-entropy (aka Log loss) n.BCE Torch.n p r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.2, .4, .8, .3, .9) y = p y = log(p)
Loss =
1 = – /5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) Target (0, 0, 1, 0, 1) - - - - -
+ (1*log(0.7)) + (1*log(0.9)) ] t r r g n i a a e t o l t e C i u = 0.28 D B P H Classification loss functions
oss pyLoss Categorical cross-entropy .NLLL sEntro orch.nn n.Cros T Torch.n
This one does the softmax for you r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.8, 0, 0, .2, 0)
Must be a probability distribution
Loss =
= – [ 1*log(0.8) ] Target (1, 0, 0, 0, 0) - - - - -
t r r g n i a a e = 0.22 t o l t e C i u D B P H Classification loss functions
oss pyLoss Categorical cross-entropy .NLLL sEntro orch.nn n.Cros T Torch.n
This one does the softmax for you r r n i
Input e t g a l t t o i a e u D C B H P
- - - - - (.8, 0, 0, .2, 0)
Must be a probability distribution
Loss =
This loss function can also be used when you = – [ 1*log(0.8) ] have a single binary Target (1, 0, 0, 0, 0) (e.g. yes/no) output - - - - -
t r r g n i a a e = 0.22 t o l t e C i u D B P H Regression loss functions Loss functions
● Regression (i.e. function approximation) ● Mean Squared Error ● Mean Absolute Error ● Huber Loss
aka L2 loss Regression loss functions
Loss Mean squared error n.MSE Torch.n
Input
Target Loss
aka L2 loss Regression loss functions
Loss Mean squared error n.MSE Torch.n
Input
Q(St,At)
t t state S & action A
Target
reward Rt+1 Q(St+1,At+1)
t+1 t+1 state S & [best] action A aka L2 loss Regression loss functions
Loss Mean squared error n.MSE Torch.n Loss
Q(St,At) Input
Q(St,At)
t t state S & action A
Target
reward Rt+1 Q(St+1,At+1)
t+1 t+1 state S & [best] action A aka L1 loss Regression loss functions
ss Absolute error n.L1Lo Torch.n
Input
Target Loss
Huber loss combines the best features of MSE and AE: - like MSE for small errors. Regression loss - like AE otherwise.
This avoids over-shooting: Loss oothL1 - MSE has a very steep Huber loss .nn.Sm Torch gradient when the error is large (i.e. for outliers) because of its quadratic form. Input - AE has a steep gradient when the error is small because it doesn’t have a ‘flat’ bottom.
loss
Target
Loss
error