Regularization In Machine Learning
When the number of predictors i.e. the underlying columns in your data, increase the model complexity increases along with it. This might also lead to models that have a higher chance of overfitting.
Instead of completely “deleting” certain predictors from a model (which is equal to setting coefficients equal to zero), wouldn’t it be interesting to just reduce the values of the coefficients to make them less sensitive to noise in the data? Penalized estimation operates in a way where parameter shrinkage effects are used to make some or all of the coefficients smaller in magnitude (closer to zero). Some of the penalties have the property of performing both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. Ridge and Lasso regression are two examples of penalized estimation. There are multiple advantages to using these methods:
- They reduce model complexity
- The may prevent from overfitting
- Some of them may perform variable selection at the same time (when coefficients are set to 0)
- They can be used to counter multicollinearity
when solving for a linear regression, you can express the cost function as
If you have multiple predictors, you would have something that looks like:
where k is the number of predictors.
Ridge regression
In ridge regression, the cost function is changed by adding a penalty term to the square of the magnitude of the coefficients.
We want to minimize cost function, so by adding the penalty term λ, ridge regression puts a constraint on the coefficients m. This means that large coefficients penalize the optimization function. That’s why ridge regression leads to a shrinkage of the coefficients and helps to reduce model complexity and multicollinearity.
λ is a so-called hyperparameter, which means you have to specify the value for lambda. For a small lambda, the outcome of your ridge regression will resemble a linear regression model. For large lambda, penalization will increase and more parameters will shrink.
Ridge regression is often also referred to as L2 Norm Regularization.
Lasso regression
Lasso regression is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term. So, while Ridge regression keeps the sum of the squared regression coefficients (except for the intercept) bounded, the Lasso method bounds the sum of the absolute values.
The resulting cost function looks like this:
The name “Lasso” comes from “Least Absolute Shrinkage and Selection Operator”.
While it may look similar to the definition of the Ridge estimator, the effect of the absolute values is that some coefficients might be set exactly equal to zero, while other coefficients are shrunk towards zero. Hence the Lasso method is attractive because it performs estimation and selection simultaneously. Especially for variable selection when the number of predictors is very high.
Lasso regression is often also referred to as L1 Norm Regularization.
Standardization before Regularization
An important step before using either Lasso or Ridge regularization is to first standardize data such that it is all on the same scale. Regularization is based on the concept of penalizing larger coefficients, so if you have features that are on different scales, some will get unfairly penalized.
A downside of standardization is that the value of the coefficients become less interpretable and must be transformed back to their original scale if you want to interpret how a one unit change in a feature impacts the target variable.
Happy reading !!!