Activation Functions In Deep Learning.
What is a Neural Network?
Neural Networks form the backbone of deep learning. The goal of a neural network is to find an approximation of an unknown function. It is formed by interconnected neurons. These neurons have weights, and bias which is updated during the network training depending upon the error. The activation function puts a nonlinear transformation to the linear combination which then generates the output. The combinations of the activated neurons give the output.
Process Of Neural Network:
The basic process carried by a neuron in a neural network is take input multiply the neuron’s weight and then add bias. Feed the result, x to the activation function: f(x). Take the output and transmit to the next layer of neurons.
What is an Activation Function?
Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.
Activation Functions can be divided into 3 types:
- Binary Step Function
- Linear Activation Function
- Non Linear Activation Function
Binary Step Activation Function:
A Binary Step Activation Function is a threshold based activation function. If the value of Y is above a certain value, declare it activated if it’s less than threshold than it is not activated.
In other words if A=1 if y>threshold, 0 otherwise. Let's look at the figure of Step Activation..
In the above figure you can notice that output is 1(activated) when value >0(threshold) and outputs a 0 when not activated.
The problem with the Step Function is it does not allow multi-value outputs.
Linear Activation Function:
A Linear Activation Function is a straight line function where activation is proportional to input. A Linear Activation Function takes the form A = cx.
It takes the inputs, multiplied by the weights for each neuron and creates an output signal proportional to the input. Linear Function is better than Step Function because it allows multiple outputs.
Although Linear Function allows multiple outputs there are few problems in this function.
- Linear Function cannot use back propagation.
- No matter how many layers we have in our neural network, the last layer will be a linear function of the first layer.
Non Linear Activation Function:
Most of the Neural Network models these days use Non Linear Activation Functions. These function allow to create complex mapping between inputs and outputs which is required for learning and modeling complex data such as images, text etc.,
These functions allow backpropagation because they have a derivative function which is related to inputs.
Here are the 7 common Non Linear Activation Functions:
Sigmoid Activation Function:
Sigmoid Function is a probabilistic approach towards decision making and ranges between 0 and 1. The equation for the Sigmoid Function is given as follows.
The main problem with the Sigmoid Function is ‘vanishing gradients’. Vanishing gradients occur because when we convert the large inputs in between the range of 0 and 1 and therefore their derivatives become much smaller which does not give satisfactory output.
TanH Activation Function:
The TanH Function goes between -1 and +1, and in fact it is a shifted version of the Sigmoid Function. The equation for TanH is given as follows.
RELU (Rectified Linear Unit):
RELU is the most popular activation function. The equation of RELU is given as follows.
RELU is computationally efficient. Although it looks like a Linear Function it allows back propagation. The problem with RELU is when inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform back propagation and cannot learn. This problem is called ‘The Dying RELU’.
Leaky RELU:
Leaky Relu have a small positive slope in the negative area, so it enable back propagation even for the negative input values.Leaky RELU solves the dying RELU problem. The equation of Leaky RELU is given as follows.
Soft Max:
Soft Max is a generalization of sigmoid function to a multi-class setting. It’s popularly used in the final layer of multi-class classification. The Soft Max basically gives value to the input variable according to their weight and the sum of these weights is eventually one.
Happy reading !!!
References:
Hands on Machine Learning with Scikit-Learn, Keras & Tensor Flow