ElasticNet Regularization on Linear Models
In my coding bootcamp program we discussed Variance/Bias trade off in linear regression and two methods to regularize the data. These methods, Lasso and Ridge regularization are ways to improve our model. In this blog I’ll go over an extension to these models that combines the two into one method. This model is called Elastic Net regularization and attempts to capture the best of both models.
The Bias Variance Tradeoff

Generally when we’re building and optimizing a model we are trying to strike a balance between the best prediction output we can make from our training data while making sure that we are not overfitting our data. An overfit model will tend to exhibit high variance in that it will fit the training data very well but is likely to miss on the test data. In other words, there is large variance in how the model will perform on test data as small changes to the input data will cause potentially wild swings in output. An underfit model will tend to exhibit low variance but high bias in that it wont change much on small changes in predictor variables. However it is likely to have consistently large error terms vs the actual results.
Our goal when building our predictive models is to find a level of complexity that minimizes the combination of bias and variance. When building a large linear regression model with lots of inputs, we often find that some of these predictor variables are significantly different from 0 yet have relatively small coefficients. This looks like a high variance model and is potentially ripe for improving by finding a way to reduce this variance by making the model simpler.
The original two methods for doing this are Lass and Ridge regularization. I’ll go through these briefly before looking at Elastic Net regularization, a combination of the two.
Lasso Regularization

Lasso regularization is an extension to OLS where you penalize coefficients that only provide a small amount of predictive power. This penalty is controlled by lambda and at 0 looks like a regular OLS model and at infinity will set all coefficients to 0 result in a constant function. Lasso models can be helpful in reducing the number of predictor variables as it has a tendency to reduce coefficients to 0.
Ridge Regularization

Ridge regularization is similar to Lasso in that it also adds an additional penalty term, scaled by lambda, to the OLS equation. Unlike Lasso, the Ridge equation uses the sum of the square of the coefficient values before scaling by lambda. Similar to Lasso the lambda will determine how the equation performs with 0 looking like a regular OLS and infinity also approaching a constant function. A major difference between Lasso and Ridge is that Ridge regularization will not set coefficient values to zero and is not particularly helpful in feature selection.
Elastic Net Regularization

Although the format of the equation looks slightly different, we can see that Elastic Net regularization is simply finding the minimum errors using OLS, Ridge, and Lasso error values. We now have both a lambda 1 and 2 representing the lasso and ridge portions respectively. As before, if both lambda are set at 0 then we have an OLS equation and as they both go to infinity we will get a constant formula.
Generally, Elastic Net regularization has been shown to perform better than Lasso in most situations while still retaining its feature selection capability unlike a Ridge regularization. There is a cost in determining two lambda values but we can work through that with built in scikit packages.