Intuition — Why L1 Regularization Pushes Coefficients to 0

Joseph Gatto
3 min readMar 23, 2021

I always hear a general hand-wavey statement that “Ridge Regression pushes coefficients towards zero, but Lasso Regression actually zeros out coefficients”… but why? Looking at the math didn’t make this statement immediately obvious for me so I wanted to write out an explicit example.

Suppose we have the following dataset

Where I will now refer to NQ as neighborhood quality, SQ as school quality, and HP as housing price.

We quickly notice that we can predict the price of a house perfectly with the function 2×HP=NQ. That is, we don’t even need to know anything about school quality.

When formulating a regression for this problem, we will want parameters α,β (and an intercept b but let's ignore the intercept for now), such that we can compute HP=α×NQ+β×SQ

Our MSE error function will thus be

If we use L1 regularization, this turns into LASSO regression and our Error function changes to

For the sake of clarity, suppose our parameters α and β can only be discrete values. In this case, there are three reasonable settings of (α,β). We will find that either (α,β=1) or (α=2,β=0) or (α=0,β=2) These are what I call the “reasonable” cases because each one gets us either the exact mapping from NQ, SQ to HP or very close to it. Any other choice of integer parameter values would place us very far away from the target.

All of these minimize the regularization term, thus our optimizer will select whichever values will minimize the first term of E(α,β), which in this case is α=2,β=0 since we know 2×HP=NQ.

As we can see, LASSO performed implicit feature selection by using the L1 term. This is just one example of when LASSO may zero out a parameter, but now notice how, when we are using L2 regularization in the same context, we find a different result.

Consider the Ridge version of our cost function:

Now, in this situation (α,β)=1 minimizes the regularization term, highlighting how Ridge encourages parameter values of a smaller magnitude.

There is a great discussion about the sparsity of LASSO here. I would like to add a quote from this post that I think nicely summarizes the general behavior of the L1 vs L2 regularizer.

“Notice that for L1, the gradient (of the L1 term) is either 1 or -1, except for when w=0. That means that L1-regularization will move any weight towards 0 with the same step size, regardless of the weight’s value. In contrast, you can see that the L2 gradient is linearly decreasing towards 0 as the weight goes towards 0. Therefore, L2-regularization will also move any weight towards 0, but it will take smaller and smaller steps as a weight approaches 0.”

As we can see from this quote, it is of course possible that the gradients of either function can push the weights to zero, it is just far more likely in the context of L1 regularization

This post consists of notes from the following sources:

--

--