Loss Functions

Loss functions Loss functions How good is your neural network? Loss functions How good is your neural network? Inputs e.g. ● Image (pixel values) Inputs ● Sentence (encoded) Outputs ● State (for RL) Loss Targets Targets e.g. ● ‘Cat’/’Dog’ (encoded) ● Next word (encoded) ● TD(0) (for RL) Loss functions Some pedantry Loss function and Cost function are often used interchangeably but they are different: ● The loss function measures the network performance on a single datapoint ● The cost function is the average of the losses over the entire training dataset Our goal is to minimize the cost function. In reality, we actually use batch losses as a proxy for the cost function. Loss/Cost functions Some stuff to remember ● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence gradient descent ● A cost function must faithfully represent the purpose of the network. Loss/Cost functions Some stuff to remember ● The cost function distils the performance of the network down to a single scalar number. ● Generally (although not necessarily) loss ≥ 0 (so cost ≥ 0 also) ● cost=0 ⇒ perfect performance (on the given dataset) ● The lower the value of the cost function, the better the performance of the network. ● Hence gradient descent ● A cost function must faithfully represent the purpose of the network. cat 0.85 Accuracy: 1 0.15 Loss (CCE): 0.07 dog Loss/Cost functions Loss functions ● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy ● Regression (i.e. function approximation) ● Mean Squared Error ● Mean Absolute Error ● Huber Loss Classification loss functions Loss functions ● Classification ● Maximum likelihood ● Binary cross-entropy (aka log loss) ● Categorical cross-entropy Maximum likelihood Maximum Target Input Classification loss (1, 1, 0, 0, 0) Cat - Dog - Bear - Hitler - Putin - functions (.5, .9, .2, .6, .1) - Cat - Dog - Bear - Hitler - Putin Maximum likelihood Maximum Target Input Classification loss (1, 1, 0, 0, 0) Cat - Dog - Bear - Hitler - Putin - Loss = Loss = = –0.7 = – 1 / 5 [ (1*0.5 + 0*0.5) + (1*0.9 + 0*0.1) + (0*0.2 + 1*0.8) + (0*0.6 + 1*0.4) + (0*0.1 + 1*0.9) ] functions (.5, .9, .2, .6, .1) - Cat - Dog - Bear - Hitler - Putin Binary Binary cross-entropy (aka Log loss) Target Input Classification loss (0, 0, 1, 0, 1) Cat - Dog - Bear - Hitler - Putin - Loss = Loss functions T (.2, .4, .8, .3, .9) o r - Cat c h . n - Dog n . B - Bear C E L - Hitler o s - Putin s Binary Binary cross-entropy (aka Log loss) Target Input Classification loss (0, 0, 1, 0, 1) Cat - Dog - Bear - Hitler - Putin - Loss = Loss = = 0.28 = – 1 / 5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) + (1*log(0.7)) + (1*log(0.9)) ] functions T (.2, .4, .8, .3, .9) o r - Cat c h . n - Dog n . B - Bear C E L - Hitler o s - Putin s Binary Binary cross-entropy (aka Log loss) Target Input Classification loss (0, 0, 1, 0, 1) Cat - Dog - Bear - Hitler - Putin - Loss = Loss = = 0.28 = – 1 / 5 [ (1*log(0.8)) + (1*log(0.6)) + (1*log(0.8)) + (1*log(0.7)) + (1*log(0.9)) ] functions T (.2, .4, .8, .3, .9) o r - Cat c h . n - Dog n . B - Bear C E L - Hitler o s - Putin s y y = y = log(p) y = p p Categorical cross-entropy Target Input Classification loss (1, 0, 0, 0, 0) Cat - Dog - Bear - Hitler - Putin - Loss = Loss = = 0.22 = – [ 1*log(0.8) ] T o r c h . n n . N L L L o s s functions (.8, 0, 0, .2, 0) T o r - Cat c h . - Dog n n . - Bear C r o s - Hitler s E n - Putin t r o p y L o s distribution probability a be Must s softmax for you for softmax the does one This Categorical cross-entropy Target Input Classification loss (1, 0, 0, 0, 0) Cat - Dog - Bear - Hitler - Putin - Loss = Loss = = 0.22 = – [ 1*log(0.8) ] T o r c h . n n . N L L L o s s functions (.8, 0, 0, .2, 0) T o r - Cat c h . - Dog n n . - Bear C r o s - Hitler s E n - Putin t r o p y L o s also be used when you you when used be also distribution probability a be Must This loss function can can function loss This s have a single binary binary single a have (e.g. yes/no) output yes/no) (e.g. softmax for you for softmax the does one This Regression loss functions Loss functions ● Regression (i.e. function approximation) ● Mean Squared Error ● Mean Absolute Error ● Huber Loss aka L2 loss Regression loss Mean squared error functions Input Torch.nn.MSELoss Target Loss aka L2 loss Regression loss functions Loss Mean squared error n.MSE Torch.n Input Q(St,At) t t state S & action A Target reward Rt+1 Q(St+1,At+1) t+1 t+1 state S & [best] action A aka L2 loss Regression loss functions Loss Mean squared error n.MSE Torch.n Loss Q(St,At) Input Q(St,At) t t state S & action A Target reward Rt+1 Q(St+1,At+1) t+1 t+1 state S & [best] action A aka L1 loss Regression loss Absolute error functions Torch.nn.L1Loss Input Target Loss Huber loss combines the best features of MSE and AE: - like MSE for small errors. Regression loss - like AE otherwise. This avoids over-shooting: Loss oothL1 - MSE has a very steep Huber loss .nn.Sm Torch gradient when the error is large (i.e. for outliers) because of its quadratic form. Input - AE has a steep gradient when the error is small because it doesn’t have a ‘flat’ bottom. loss Target Loss error.

Load more