Photo by Markus Winkler on Unsplash
Quick introduction to Loss Functions in Neural Network
Most Commonly Used Deep Learning Loss Functions
Table of contents
Introduction
Loss Functions in deep learning (or machine learning) problems play a very important role to determine how well is your model or algorithm doing in terms of predicting the expected outcome.
The lesser the value of the loss function, the better the model is performing and vice-versa.
So for better performing Deep Learning models, we always want to minimize the loss function as much as we can.
Loss functions are often known as "Cost Functions" as well, but there is a slight difference between them. The Cost function is determined by averaging all the loss function values whereas we calculate the loss function for each predicted output by comparing with it's actual value.
Sometimes loss functions are also interchangeably used as "error functions" because while training each example during the learning process we want to minimize the errors, and these loss functions help to obtain the value of it.
Mathematically, Loss functions can be connotated as,
$$Loss = \lvert y{prediction} - y{actual} \rvert$$
What is the need
The loss function is that metric by which you'll be more practically accurate about your model's performance. For example, we have trained a neural network model and are ready for evaluation on the featured test dataset, but how we can express the evaluation results to showcase to someone how well or badly our model is performing, models can predict anything based on the training provided but how accurate that prediction is, is determined by loss functions values.
And most importantly, neural networks are trained by the process of optimization and backpropagation which requires a loss function to calculate the model's error. So to optimize our model while training we need these beautiful loss functions.
Types of Loss Function
Loss functions are categorised based on the problems we are trying to solve through deep learning models, and the most common types of problems deep learning models solve are,
Binary Classification
Multi-class Classification
Regression Problems
There are other complex neural network models available in the market now to solve complex problems such as GANs, Object Detection but for this blog, we'll most likely go through a very high level.
Let's look into all these one by one.
Binary Classification Loss Function
Binary Classification can be defined as a model that can predict either one thing or another based on the input feature data. When we input the feature vector into the model to predict it gives prediction probabilities of shape (2,)
which means the model is predicting among 2
classes.
Let's see some of the loss functions which are commonly used in Binary Classification,
1. Binary Cross-entropy Loss
The most widely used loss function for classification problems and also called Logarithmic loss.
First, the model generates the prediction probability ranging from (0,1)
to the given input and, then Binary Cross-entropy compares each of these prediction probabilities (pred_prob
) to the actual class output(y_true
) which can be 0
or 1
.
Based on that it computes the difference between the prediction probability and the actual value which is called the corrected probability and the log value.
Suppose, y_true = 1
, pred_prob=0.46
, and the corrected probability would be 0.54
.
The same goes, for case y_true = 0
, pred_prob=0.82
, the corrected probability would be 0.18
.
Hence, the cross-entropy can be determined as, the negative average sum of the log of prediction probability of class 1 and the log of prediction probability of class 0, which means,
$$- \frac{1}{N} \sum_{i=0}^n y_ilog(p_i) + (1-y_i)log(1-p_i)$$
If we want to implement this using tensorflow, it has a method tf.keras.losses.BinaryCrossentropy,
# we have to mention while compiling the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy,
optimizer='Adam',
metrics=['accuracy'])
Make sure that output
activation function must be sigmoid
in the network.
Multi-class Classification Loss Function
Multi-class classification is simply the scaled version of the binary classification problem which means instead of 2
classes (one thing or another) to predict, now the model has to be trained with more than 2
classes and do the prediction for n
number of classes.
Loss functions used for multi-class classification loss functions are,
Categorical & Sparse Categorical Cross-entropy
Same as Binary Cross-entropy, Categorical Cross-entropy and Sparse Categorical Cross-entropy operate similarly but just this is for multi-class classification problems.
The difference between these two is first accept one_hot
representation of the target variable or the labels and the latter accepts integers as labels.
For example, if we are predicting food in the picture and the model is going to be trained with 3 classes (pizza, steak and noodles), in the case of Categorical Cross entropy, for pizza, the label representation must be [1, 0, 0]
.
In mathematical connotations,
$$- \frac{1}{N} \sum{i}^N \sum{i}^M y{ij}log(p{ij})$$
N = number of rows
M = number of classes
Tensorflow provides methods for both, tf.keras.losses.CategoricalCrossentropy and tf.keras.losses.SparseCategoricalCrossentropy
model.compile(loss=tf.keras.losses.CategoricalCrossentropy, # tf.keras.losses.SparseCategoricalCrossentropy
optimizer='Adam',
metrics=['accuracy'])
For multi-class classification problems, softmax
function is used as the activation function in the output layer.
Regression Loss Function
Regression problems are refers to predicting a quantity or a value based on the history of data given as an input feature vector. Predicting a number could be anything like predicting the price of a house, stock prices etc.
Let's see which loss functions help regression models to optimize,
1. Mean Absolute Error (MAE)
Mean absolute error loss is very easy to determine as the average of absolute differences between the predicted and the actual values.
The perfect value for a regression model of MAE should be 0.0
if the model is performing very well. Through the optimization function as we mentioned above as Adam
, it can be minimized.
The formula for MAE,
$$\frac {1}{N} \sum{i=1}^{N} \lvert y{pred} - y_{true} \rvert$$
N is the number of data points
In tensorflow
, there is a method called tf.keras.losses.MeanAbsoluteError to apply the loss function in the model.
2. Mean Squared Error (MSE)
Similar to MAE, Mean squared Error is the squared version of the difference between the predicted and actual values instead of absolute.
The formula for MSE is,
$$\frac {1}{N} \sum{i=1}^{N} (y{pred} - y_{true})^2$$
tf.keras.losses.MeanSquaredError method is applied as a loss function for compiling a model while training in tensorflow
.
Further Reading
This blog will help to touch base with loss functions in neural networks and how to use them with tensorflow
library. But if you're interested to expand the knowledge of loss functions in detail specific to your problem then I would suggest the below articles to read,
Loss Functions for Training Deep Learning Neural Network, Machine Learning Mastery
Keras Loss Functions : Everything you need to know, Derrick Mwiti
Enjoy reading!