Subset selection, regularization, and shrinkage

hrothgar · November 21, 2011

FWIW, Here's an amusing little post on feature selection techniques. Stay tuned for next week, when I compare sequential feature selection to ridge regression and the lasso. (For those who care, I am shamelessly stealing examples from Tibshirani. In my defense, I do include footnotes and this is much more about showing how to get stuff done in MATLAB code than breaking new ground)

http://blogs.mathworks.com/loren/

hrothgar · December 6, 2011

Second posting is now available

(This piece introduces a newish regression algorithm called "Lasso" which offers some significant advantages compared to traditional linear regression)

http://blogs.mathworks.com/loren/2011/11/29/subset-selection-and-regularization-part-2/

jdeegan · December 6, 2011

:P ex transpose ex inverse, ex transpose y. I have been puzzling over a, thus far mysterious, ridge regression package included in my software for years. What are the alleged advantages of this revolutionary technique?

You gotta luv Krugman. Humble country school teacher comes from nowhere to be the Rush Limbaugh of the left. Ain't America grand.

WellSpyder · December 6, 2011

I have been puzzling over a, thus far mysterious, ridge regression package included in my software for years. What are the alleged advantages of this revolutionary technique?

I have to confess that despite being a practicing economist I haven't come across ridge regression or lasso before (my thanks to hrothgar for posting these links). But I'm well aware of the dangers of "over-fitting" when estimating equations using standard regression techniques that then don't prove very useful for forecasting - for example, including too many lags because they appear to be significant when looking at t-stats, etc. Putting a premium on parsimony therefore seems an attractive idea. Doing it by "forcing" coefficients towards zero doesn't necessarily seem an intuitive way of doing it, but I can see advantages in this, too, eg in the cases where traditional estimation can often produce a series of lags with opposite signs, which even though they can be significant in helping to fit the data over the estimation period seem unlikely to be able to help in predicting the future.

hrothgar · December 6, 2011

The following presumes that you've read stuff on lasso...

Linear Regression identifies a set of coefficients that minimize the sum of the squared errors between predicted and actual.

Lasso changes this minimization problem. We identify a set of coefficients that minimizes the sum of the squared errors plus the sum of the absolute value of the regression coefficients. (We're using an L1 norm)

Ridge regression (aka Tikhonov regularization) is the same as lasso except we substitute an L1 norm for the L2 norm. This time around we identify a set of coefficients that minimized the sum of the squared errors plus the sum of the square of the coefficients. As usual, the math is a lot easier with an L2 norm, which is why Tikhonov solved this problem a long time before lasso was a twinkle in Tibshirani's eye...

As for motivation:

1. The predictive accuracy of linear regression models suffers dramatically if you have relatively wide data sets with strong correlation between your independent variables.

2. Regularization techniques like ridge regression and lasso are often able to significantly improve predictive accuracy (at the cost of increasing your bias)

3. Lasso and ridge regression differ in the choice of the norm. The L1 norm will cause the lasso to quickly drive individual regression coefficients completely to zero, there by acting as a feature selection technique. The L2 norm used by ridge will preserve larger numbers of independent variables within the model.

4. There is also something known as an elastic net which is a convex combination of a ridge regression and a lasso and offers many of the best properties of both.

S2000magic · December 6, 2011

1. The predictive accuracy of linear regression models suffers dramatically if you have relatively wide data sets with strong correlation between your independent variables.

So lasso and ridge overcome the problem of multicollinearity?

hrothgar · December 6, 2011

So lasso and ridge overcome the problem of multicollinearity?

Much of the time, yes. However, you're decreasing variance by increasing bias

hrothgar · December 6, 2011

delete

hrothgar · December 6, 2011

Putting a premium on parsimony therefore seems an attractive idea. Doing it by "forcing" coefficients towards zero doesn't necessarily seem an intuitive way of doing it, but I can see advantages in this, too, eg in the cases where traditional estimation can often produce a series of lags with opposite signs, which even though they can be significant in helping to fit the data over the estimation period seem unlikely to be able to help in predicting the future.

Here's an intuitive explanation that might help.

Assume that you have a linear model where Y = f(X1, X2, ... XN) + noise vector

Furthermore, lets assume that one of these variables is a linear function of the other.

If you run your regression, the program will probably throw some warning about a rank deficient matrix, the reason being that you can't estimate a unique values for these two coefficient. Any linear combination of the two coefficients in the right ratio is equally valid.

Now perturb one of your observations by epsilon so that you no longer have this whole "rank deficiency" issue. Your regression is going to run perfectly fine. However, there's a catch... Relatively minor changes to your noise vector are going to cause enormous swings in your regression coefficients for the two correlated variables. Sometimes they'll be sitting at (+500, + 800), the next at (-15, - 24), the time after that at (-2500, -4000). If you want to believe that these coefficients have some real world meaning, this behavior is really annoying.

Adding in the regularization term penalizes solutions that are far removed from zero and makes the entire process much more stable.

jdeegan · December 6, 2011

:P I think I get it. If multicollinearity is your problem, ridge regression is a possible remedy. I always favored leaving out the surplus variable(s), or building a better model.

S2000magic · December 6, 2011

:P I think I get it. If multicollinearity is your problem, ridge regression is a possible remedy. I always favored leaving out the surplus variable(s), or building a better model.

That's so old-school; you probably also bid suits you have.

jdeegan · December 6, 2011

:P And I double for penalties whenever possible.

S2000magic · December 7, 2011

:P And I double for penalties whenever possible.

OK, now you're scaring me.

jdeegan · December 7, 2011

:P My wife, who is on the far side of 60, drives a Porsche Carrera rag top with a rear spoiler that deploys when you get over 70 mph. But, I would love a spin in your hi rev Honda.

WellSpyder · December 7, 2011

If multicollinearity is your problem, ... I always favored leaving out the surplus variable(s).

And which variable is that?

S2000magic · December 7, 2011

:P My wife, who is on the far side of 60, drives a Porsche Carrera rag top with a rear spoiler that deploys when you get over 70 mph. But, I would love a spin in your hi rev Honda.

I drove a 911S for many years (20,000 miles when I bought it, 180,000 miles when I sold it), and I can tell you hands down the S2000 is more fun to drive than the Porsche; and the Porsche was a blast to drive!

There is something sweet about a 9,000 RPM redline.

S2000magic · December 7, 2011

And which variable is that?

If they're sufficiently strongly correlated (positively or negatively), does it really matter which one(s) you drop?

helene_t · December 7, 2011

Much of the time, yes. However, you're decreasing variance by increasing bias

LASSO is not good at dealing with correlated predictors. If there are two strongly correlated predictors it may simply be impossible to determine which of the two is the causal one and which one works only through confounding with the other. In that case, the most robust thing you can do is to give each of them approximately equal influence. This what RIDGE does. Stepwise AIC has the same problem as LASSO.

So if your main concern is to deal correctly with correlated predictors, RIDGE is preferable to just about everything else, although I suppose the best thing to do would be to have a serious talk with the domain expert to try to get to a more advanced model that captures the domain knowledge better. For example, you might put an L2 (RIDGE) penalty on coefficients that belong to clusters of two or more correlated predictors, while putting an L1 (LASSO) penalty on the lonely riders. RIDGE and LASSO are somewhat adhoc methods, they are the methods you will use when you have large data sets but shallow domain knowledge.

As for the bias, yes, but that is intentional, you apply biased estimators like RIDGE when the bias is a virtue. You have a prior belief that small coefficients are more plausible than large ones so the mean (or mode) posterior belief must be smaller than an unbiased estimator.

nige1 · December 7, 2011

Thank you Hrothgar. Loren Shure is brilliant at making this beautiful stuff more comprehensible to tyros like me. MATLAB seems powerful and succinct. A bit like APL, a language devised by Ken Iverson of IBM, popular in the 60s and 70s that works best with a special mathematical character-set. More recently, in rec.games.bridge, Charles Brenner uses APL to solve Bridge probability problems, without recourse to crude simulation.

hrothgar · December 7, 2011

Thank you Hrothgar. Loren Shure is brilliant at making this beautiful stuff more comprehensible to tyros like me.

Loren is, indeed, great.

However, I feel obliged to point out that those two articles (and all the code) were authored by moi...

mwalimu02 · December 11, 2011

Wow, nice blog post!

Subset selection, regularization, and shrinkage

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation