Number of samples in cost function for Ridge/Lasso regression #23407

helgehr · 2022-05-18T14:24:49Z

helgehr
May 18, 2022

I am using scikit-learn to train some regression models on data and noticed that the cost function for Lasso regression is defined like this:

,

whereas the cost function for e.g. Ridge regression is shown as:

.

I had a look in the code (Lasso & Ridge) as well and the implementations of the cost functions look like described above. I am confused why the 1/n_samples factor is only present in the Lasso regression case.

From my perspective it makes sense to have a scaling of the residuals inversely proportional to the number of samples so that if an algorithm is used on a dataset with more training samples the value of alpha should be somehow invariant to that. In the Elastic Net class, which can be understood as a combination of Lasso and Ridge regression, we also see that factor of 1/n_samples. Can someone explain why this factor is not present in the cost function of Ridge regression?

My related stackexchange question: here

sanjaradylov · 2022-05-21T11:13:01Z

sanjaradylov
May 21, 2022

With solver='svd' or 'cholesky', scaling will probably not make Ridge invariant to sample sizes: https://stats.stackexchange.com/questions/575752/number-of-samples-in-scikit-learn-cost-function-for-ridge-lasso-regression/576047#576047

4 replies

lorentzenchr Jul 4, 2022
Maintainer

It is a matter of convention, but has nothing to do with solvers. In the end, one usually searches for the best penalty strength alpha anyway, e.g. via cross validation.
See also this answer: https://stats.stackexchange.com/a/562494

helgehr Jul 4, 2022
Author

In my case I cannot assume that my target data ist normal distributed (as required by the answer you referenced), so It would make a difference I guess.
It is true that you probably would optimize the hyperparameter again for additional data but I feel it is just inconsistent btw. the different regression methods.

Would you say that the factor of 1/n_samples is left out for ridge regression because of the fact mentioned in your referenced answer for the special case of a gaussian target Y with an orthogonal design matrix?

lorentzenchr Jul 4, 2022
Maintainer

If we were to implement it from scratch, we would make Lasso, Elastic Net and Ridge consistent, but for backward compatibility it is what it is.
I want to stress again, that this is a purely aesthetic argument. In practice, you need to chose or find a penalty strength somehow. For instance, finding it via grid search and cross validation, the different versions of normalization would give different optimal values for alpha, but the model predictions would be the same (even same coefficients w). And that is what matters in the end.

helgehr Jul 5, 2022
Author

alright - thank you, that is what I wanted to know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Number of samples in cost function for Ridge/Lasso regression #23407

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Number of samples in cost function for Ridge/Lasso regression #23407

Uh oh!

helgehr May 18, 2022

Replies: 1 comment · 4 replies

Uh oh!

sanjaradylov May 21, 2022

Uh oh!

lorentzenchr Jul 4, 2022 Maintainer

Uh oh!

helgehr Jul 4, 2022 Author

Uh oh!

lorentzenchr Jul 4, 2022 Maintainer

Uh oh!

helgehr Jul 5, 2022 Author

helgehr
May 18, 2022

Replies: 1 comment 4 replies

sanjaradylov
May 21, 2022

lorentzenchr Jul 4, 2022
Maintainer

helgehr Jul 4, 2022
Author

lorentzenchr Jul 4, 2022
Maintainer

helgehr Jul 5, 2022
Author